-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[autoscaler] Add documentation for multi node type autoscaling #10405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
9a6b709
update
ericl cd18999
update
ericl d7b38c8
fix
ericl 03b235a
fix
ericl 6b1dfd6
update
ericl b6ca886
Apply suggestions from code review
ericl 56c9e8f
comments
ericl d7fb9b1
Merge branch 'autoscaling-doc' of github.com:ericl/ray into autoscali…
ericl c3d93d4
Update doc/source/cluster/autoscaling.rst
ericl f75a0b9
detail
ericl c80a939
update
ericl 269a4d8
update
ericl e3d2494
update
ericl 5cc0b83
update
ericl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| .. _ref-autoscaling: | ||
|
|
||
| Cluster Autoscaling | ||
| =================== | ||
|
|
||
| Basics | ||
| ------ | ||
|
|
||
| The Ray Cluster Launcher will automatically enable a load-based autoscaler. When cluster resource usage exceeds a configurable threshold (80% by default), new nodes will be launched up to the specified ``max_workers`` limit (specified in the cluster config). When nodes are idle for more than a timeout, they will be removed, down to the ``min_workers`` limit. The head node is never removed. | ||
|
|
||
| In more detail, the autoscaler implements the following control loop: | ||
|
|
||
| 1. It calculates the estimated utilization of the cluster based on the most-currently-assigned resource. For example, suppose a cluster has 100/200 CPUs assigned, but 20/25 GPUs assigned, then the utilization will be considered to be max(100/200, 15/25) = 60%. | ||
| 2. If the estimated utilization is greater than the target (80% by default), then the autoscaler will attempt to add nodes to the cluster. | ||
| 3. If a node is idle for a timeout (5 minutes by default), it is removed from the cluster. | ||
|
|
||
| The basic autoscaling config settings are as follows: | ||
|
|
||
| .. code:: | ||
|
|
||
| # An unique identifier for the head node and workers of this cluster. | ||
| cluster_name: default | ||
|
|
||
| # The minimum number of workers nodes to launch in addition to the head | ||
| # node. This number should be >= 0. | ||
| min_workers: 0 | ||
|
|
||
| # The autoscaler will scale up the cluster to this target fraction of resource | ||
| # usage. For example, if a cluster of 10 nodes is 100% busy and | ||
| # target_utilization is 0.8, it would resize the cluster to 13. This fraction | ||
| # can be decreased to increase the aggressiveness of upscaling. | ||
| # The max value allowed is 1.0, which is the most conservative setting. | ||
| target_utilization_fraction: 0.8 | ||
|
|
||
| # If a node is idle for this many minutes, it will be removed. A node is | ||
| # considered idle if there are no tasks or actors running on it. | ||
| idle_timeout_minutes: 5 | ||
ericl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Multiple Node Type Autoscaling | ||
| ------------------------------ | ||
|
|
||
| Ray supports multiple node types in a single cluster. In this mode of operation, the scheduler will look at the queue of resource shape demands from the cluster (e.g., there might be 10 tasks queued each requesting ``{"GPU": 4, "CPU": 16}``), and tries to add the minimum set of nodes that can fulfill these resource demands. This enables precise, rapid scale up compared to looking only at resource utilization, as the autoscaler also has visiblity into the aggregate resource load. | ||
|
|
||
| The concept of a cluster node type encompasses both the physical instance type (e.g., AWS p3.8xl GPU nodes vs m4.16xl CPU nodes), as well as other attributes (e.g., IAM role, the machine image, etc). `Custom resources <configure.html>`__ can be specified for each node type so that Ray is aware of the demand for specific node types at the application level (e.g., a task may request to be placed on a machine with a specific role or machine image via custom resource). | ||
|
|
||
| Multi-node type autoscaling operates in conjunction with the basic autoscaler. You may want to configure the basic autoscaler accordingly to act conservatively (i.e., set ``target_utilization_fraction: 1.0``). | ||
|
|
||
| An example of configuring multiple node types is as follows `(full example) <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-multi-node-type.yaml>`__: | ||
|
|
||
| .. code:: | ||
|
|
||
| # Specify the allowed node types and the resources they provide. | ||
| # The key is the name of the node type, which is just for debugging purposes. | ||
| # The node config specifies the launch config and physical instance type. | ||
| available_node_types: | ||
| cpu_4_ondemand: | ||
| node_config: | ||
| InstanceType: m4.xlarge | ||
| resources: {"CPU": 4} | ||
| max_workers: 5 | ||
| cpu_16_spot: | ||
| node_config: | ||
| InstanceType: m4.4xlarge | ||
| InstanceMarketOptions: | ||
| MarketType: spot | ||
| resources: {"CPU": 16, "Custom1": 1, "is_spot": 1} | ||
| max_workers: 10 | ||
| gpu_1_ondemand: | ||
| node_config: | ||
| InstanceType: p2.xlarge | ||
| resources: {"CPU": 4, "GPU": 1, "Custom2": 2} | ||
| max_workers: 4 | ||
| worker_setup_commands: | ||
| - pip install tensorflow-gpu # Example command. | ||
| gpu_8_ondemand: | ||
| node_config: | ||
| InstanceType: p2.8xlarge | ||
| resources: {"CPU": 32, "GPU": 8} | ||
| max_workers: 2 | ||
| worker_setup_commands: | ||
| - pip install tensorflow-gpu # Example command. | ||
|
|
||
| # Specify the node type of the head node (as configured above). | ||
| head_node_type: cpu_4_ondemand | ||
|
|
||
| # Specify the default type of the worker node (as configured above). | ||
| worker_default_node_type: cpu_16_spot | ||
|
|
||
|
|
||
| The above config defines two CPU node types (``cpu_4_ondemand`` and ``cpu_16_spot``), and two GPU types (``gpu_1_ondemand`` and ``gpu_8_ondemand``). Each node type has a name (e.g., ``cpu_4_ondemand``), which has no semantic meaning and is only for debugging. Let's look at the inner fields of the ``gpu_1_ondemand`` node type: | ||
|
|
||
| The node config tells the underlying Cloud provider how to launch a node of this type. This node config is merged with the top level node config of the YAML and can override fields (i.e., to specify the p2.xlarge instance type here): | ||
|
|
||
| .. code:: | ||
|
|
||
| node_config: | ||
| InstanceType: p2.xlarge | ||
|
|
||
| The resources field tells the autoscaler what kinds of resources this node provides. This can include custom resources as well (e.g., "Custom2"). This field enables the autoscaler to automatically select the right kind of nodes to launch given the resource demands of the application. The resources specified here will be automatically passed to the ``ray start`` command for the node via an environment variable. For more information, see also the `resource demand scheduler <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/resource_demand_scheduler.py>`__: | ||
|
|
||
| .. code:: | ||
|
|
||
| resources: {"CPU": 4, "GPU": 1, "Custom2": 2} | ||
|
|
||
| The ``max_workers`` field constrains the number of nodes of this type that can be launched: | ||
|
|
||
| .. code:: | ||
|
|
||
| max_workers: 4 | ||
|
|
||
| The ``worker_setup_commands`` field can be used to override the setup and initialization commands for a node type. Note that you can only override the setup for worker nodes. The head node's setup commands are always configured via the top level field in the cluster YAML: | ||
|
|
||
| .. code:: | ||
|
|
||
| worker_setup_commands: | ||
| - pip install tensorflow-gpu # Example command. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -42,7 +42,7 @@ docker: | |||||
| # usage. For example, if a cluster of 10 nodes is 100% busy and | ||||||
| # target_utilization is 0.8, it would resize the cluster to 13. This fraction | ||||||
| # can be decreased to increase the aggressiveness of upscaling. | ||||||
| # This value must be less than 1.0 for scaling to happen. | ||||||
| # This max value allowed is 1.0, which is the most conservative setting. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| target_utilization_fraction: 0.8 | ||||||
|
|
||||||
| # If a node is idle for this many minutes, it will be removed. | ||||||
|
|
||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,12 @@ | ||
| """Implements multi-node-type autoscaling. | ||
|
|
||
| This file implements an autoscaling algorithm that is aware of multiple node | ||
| types (e.g., example-multi-node-type.yaml). The Ray autoscaler will pass in | ||
| a vector of resource shape demands, and the resource demand scheduler will | ||
| return a list of node types that can satisfy the demands given constraints | ||
| (i.e., reverse bin packing). | ||
| """ | ||
|
|
||
| import copy | ||
| import numpy as np | ||
| import logging | ||
|
|
@@ -30,18 +39,43 @@ def __init__(self, provider: NodeProvider, | |
| self.node_types = node_types | ||
| self.max_workers = max_workers | ||
|
|
||
| def debug_string(self, nodes: List[NodeID], | ||
| pending_nodes: Dict[NodeID, int]) -> str: | ||
| # TODO(ekl) take into account existing utilization of node resources. We | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @wuisawesome |
||
| # should subtract these from node resources prior to running bin packing. | ||
| def get_nodes_to_launch(self, nodes: List[NodeID], | ||
| pending_nodes: Dict[NodeType, int], | ||
| resource_demands: List[ResourceDict] | ||
| ) -> List[Tuple[NodeType, int]]: | ||
| """Given resource demands, return node types to add to the cluster. | ||
|
|
||
| This method: | ||
| (1) calculates the resources present in the cluster. | ||
| (2) calculates the unfulfilled resource bundles. | ||
| (3) calculates which nodes need to be launched to fulfill all | ||
| the bundle requests, subject to max_worker constraints. | ||
|
|
||
| Args: | ||
| nodes: List of existing nodes in the cluster. | ||
| pending_nodes: Summary of node types currently being launched. | ||
| resource_demands: Vector of resource demands from the scheduler. | ||
| """ | ||
|
|
||
| if resource_demands is None: | ||
| logger.info("No resource demands") | ||
| return [] | ||
|
|
||
| node_resources, node_type_counts = self.calculate_node_resources( | ||
| nodes, pending_nodes) | ||
| logger.info("Cluster resources: {}".format(node_resources)) | ||
| logger.info("Node counts: {}".format(node_type_counts)) | ||
|
|
||
| out = "Worker node types:" | ||
| for node_type, count in node_type_counts.items(): | ||
| out += "\n - {}: {}".format(node_type, count) | ||
| if pending_nodes.get(node_type): | ||
| out += " ({} pending)".format(pending_nodes[node_type]) | ||
| unfulfilled = get_bin_pack_residual(node_resources, resource_demands) | ||
| logger.info("Resource demands: {}".format(resource_demands)) | ||
| logger.info("Unfulfilled demands: {}".format(unfulfilled)) | ||
|
|
||
| return out | ||
| nodes = get_nodes_for(self.node_types, node_type_counts, | ||
| self.max_workers - len(nodes), unfulfilled) | ||
| logger.info("Node requests: {}".format(nodes)) | ||
| return nodes | ||
|
|
||
| def calculate_node_resources( | ||
| self, nodes: List[NodeID], pending_nodes: Dict[NodeID, int] | ||
|
|
@@ -73,36 +107,18 @@ def add_node(node_type): | |
|
|
||
| return node_resources, node_type_counts | ||
|
|
||
| def get_nodes_to_launch(self, nodes: List[NodeID], | ||
| pending_nodes: Dict[NodeType, int], | ||
| resource_demands: List[ResourceDict] | ||
| ) -> List[Tuple[NodeType, int]]: | ||
| """Get a list of node types that should be added to the cluster. | ||
|
|
||
| This method: | ||
| (1) calculates the resources present in the cluster. | ||
| (2) calculates the unfulfilled resource bundles. | ||
| (3) calculates which nodes need to be launched to fulfill all | ||
| the bundle requests, subject to max_worker constraints. | ||
| """ | ||
|
|
||
| if resource_demands is None: | ||
| logger.info("No resource demands") | ||
| return [] | ||
|
|
||
| def debug_string(self, nodes: List[NodeID], | ||
| pending_nodes: Dict[NodeID, int]) -> str: | ||
| node_resources, node_type_counts = self.calculate_node_resources( | ||
| nodes, pending_nodes) | ||
| logger.info("Cluster resources: {}".format(node_resources)) | ||
| logger.info("Node counts: {}".format(node_type_counts)) | ||
|
|
||
| unfulfilled = get_bin_pack_residual(node_resources, resource_demands) | ||
| logger.info("Resource demands: {}".format(resource_demands)) | ||
| logger.info("Unfulfilled demands: {}".format(unfulfilled)) | ||
| out = "Worker node types:" | ||
| for node_type, count in node_type_counts.items(): | ||
| out += "\n - {}: {}".format(node_type, count) | ||
| if pending_nodes.get(node_type): | ||
| out += " ({} pending)".format(pending_nodes[node_type]) | ||
|
|
||
| nodes = get_nodes_for(self.node_types, node_type_counts, | ||
| self.max_workers - len(nodes), unfulfilled) | ||
| logger.info("Node requests: {}".format(nodes)) | ||
| return nodes | ||
| return out | ||
|
|
||
|
|
||
| def get_nodes_for(node_types: Dict[NodeType, NodeTypeConfigDict], | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually resource usage, not CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we specify what resources are considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this in the paragraph above.