Skip to content
This repository has been archived by the owner on Jan 31, 2019. It is now read-only.

Cluster Scaling

Rampal Chopra edited this page Nov 21, 2018 · 2 revisions

Automated scaling of Nomad cluster nodes is a key feature of Replicator and relies on meta configuration parameters within the Nomad agent configuration. Replicator implements cluster scaling using the concept of worker pools, which is simply a logical grouping of worker nodes.

Vocabulary

  • Worker Pool: Replicator performs scaling operations against worker pools, which is a collection of Nomad cluster nodes that share similar characteristics and are treated as a logical grouping for the purposes of cluster scaling. Nodes are assigned to a worker pool by setting the replicator_worker_pool configuration directive.

    Presently, Replicator expects a worker pool to map directly to an AWS autoscaling group. Future releases will provide support for additional cloud providers and will support additional mechanisms for logical grouping of nodes.

Cluster Scaling Configuration

To configure Nomad nodes as eligible for discovery and autoscaling by Replicator, you simply add some meta configuration parameters to the Nomad agent. Replicator constantly watches the Nomad API to automatically discover cluster nodes and detect changes in node configuration or status.

When the node discovery engine detects new nodes that have the necessary Replicator configuration parameters, they are added to the Node Registry, which is an internal database of worker pools and their associated nodes. Replicator uses a custom algorithm to constantly evaluate a worker pool and determine if it should be scaled up or down meaning you aren't responsible for setting hard-coded scaling policies like CPU or memory limits.

Cluster Scaling Parameters

The meta configuration parameters that Replicator uses to control cluster scaling can be found below. Some configuration parameters have sensible default values while others are required.

Required Parameters

  • replicator_enabled (bool): Determines whether or not Replicator should perform scaling operations against this node and its associated worker pool.

  • replicator_max (int): The maximum number of nodes permitted in a worker pool.

  • replicator_min (int): The minimum number of nodes permitted in a worker pool.

  • replicator_notification_uid (string): A unique identifier that will be used to distinguish alerts about this worker pool.

  • replicator_region (string): The cloud region in which the worker pool is running.

  • replicator_worker_pool (string): The name of the worker pool to which the node belongs. Each node within a given autoscaling context should be assigned the same worker pool name.

Optional Parameters

  • replicator_cooldown (int): The waiting period, in seconds, that Replicator will enforce between cluster scaling operations. Defaults to 300 seconds.

  • replicator_node_fault_tolerance (int): The number of nodes within a worker pool that should be reserved for fault-tolerance when evaluating resource utilization across the worker pool. Defaults to 1 reserved node.

  • replicator_retry_threshold (int): The maximum number of times Replicator is permitted to retry a failed scaling event before the worker pool is placed in failsafe mode. Defaults to 3 retry attempts.

  • replicator_scaling_threshold (int): The number of consecutive times Replicator must detect a scaling operation is required before the scaling event will be permitted. Defaults to 3 consecutive scaling requests.

  • replicator_scale_factor (int): The number of nodes in the worker pool to be added when a scale out operation is taking place. Defaults to 1.

Example Configuration

Replicator takes advantage of the ability to add custom configuration parameters within the Nomad agent configuration. Each node within a worker pool should have an identical Replicator configuration.

meta {
  "replicator_cool_down"            = 300
  "replicator_enabled"              = true
  "replicator_max"                  = 10
  "replicator_min"                  = 5
  "replicator_node_fault_tolerance" = 1
  "replicator_notification_uid"     = "REP2"
  "replicator_region"               = "us-east-1"
  "replicator_retry_threshold"      = 3
  "replicator_scaling_threshold"    = 3
  "replicator_scale_factor"         = 1
  "replicator_worker_pool"          = "nomad-nodes-public-prod"
}

You can also use the replicator init -cluster-scaling command to generate a sample Replicator configuration that you can edit.

Cluster Scaling Considerations

Replicator supports scaling a worker pool up or down based on utilization and runtime constraints. While scaling a worker pool up is a fairly straight forward process, special considerations are needed when scaling down a worker pool.

When Replicator determines a worker pool requires a scale down event, it first calculates the least-allocated node in the worker pool. Once it has identified an eligible node, this node is placed in drain mode and monitored to ensure all allocations have been migrated off the node.

At this point, the scaling operation is a straight forward process of detaching the instance and terminating.

Cluster Scaling Failures

At each step of a cluster scaling event, Replicator verifies every step and where possible, attempts to retry any step in the process that has failed.

If enough consecutive failures are encountered during a scaling event to reach the replicator_retry_threshold, Replicator will place the worker pool in failsafe mode and trigger a notification to any configured notification backends. While a worker pool is in failsafe mode, Replicator will refuse to take any scaling action against it. This protection is stored persistently in the state tracking object and thus, will survive a restart of Replicator or a leadership change to another running copy of Replicator.

Once an operator has confirmed the issue causing the failures has been resolved, the failsafe protection should be removed using the failsafe command.

Cluster Scaling Algorithm

Replicator seeks to answer the fundamental question: can a worker pool continue to support its workload within defined operational constraints. Replicator needs only a single piece of information from an operator, replicator_fault_tolerance to make scaling decisions for a worker pool.

The cluster scaling process takes the following steps during each evaluation cycle for each worker pool:

  1. Calculates the total capacity of the worker pool by computing the sum of all resources on each node in the worker pool (e.g. CPU, Memory and Disk).

  2. Calculates the total consumed capacity of the worker pool by computing the sum of the used capacity of each allocation assigned to all nodes in the worker pool.

  3. Determines the number of jobs running on the worker pool, configured for scaling with Replicator and computes the sum of resources allocated to those jobs. Replicator reserves these resources as a scaling overhead to ensure we always retain enough worker pool capacity to accommodate scaling each job up by one.

  4. Dynamically determines the prioritized scaling metric, that is, the resource in the worker pool that is most heavily utilized.

  5. Computes the maximum allowed utilization of the prioritized scaling metric.

    The algorithm first determines the average node allocation (total_pool_capacity / total_pool_nodes)

    The maximum allowed utilization is then calculated by taking ((total_pool_capacity - scaling_reserve) - (average_node_allocation * fault_tolerant_node_count))

  6. If the worker pool utilization of the prioritized scaling metric is below the maximum allowed utilization, the worker pool is marked for a potential scale down event. If the worker pool utilization is above the maximum allowed utilization, the worker pool is marked for a potential scale up event.

Safety Checks

Before Replicator will trigger a scaling event, a series of safety checks must first be met. If any of the safety checks fail, the scaling event will be denied and Replicator will re-evaluate at the next cycle.

  • AWS Auto Scaling Limits: Replicator will attempt to retrieve the current configuration of the worker pool autoscaling group and determine if the scaling event would violate the min/max thresholds defined on the autoscaling group. If either threshold would be violated, the safety check fails. If Replicator is unable to retrieve the ASG configuration, the safety check fails.

  • Replicator Thresholds: Replicator will check the min/max thresholds defined by the Replicator configuration for the worker pool. If the scaling event would violate the min/max thresholds defined on the worker pool within Replicator, the safety check fails.

  • Maximum Allowed Utilization Threshold: During a scale down event, Replicator first simulates what the worker pool would look like after the removal of a node. If the simulation shows the new resource utilization would exceed the maximum allowed utilization or would be within 10% of the maximum utilization, the safety check fails. This ensures Replicator never allows a scale down event that would require an immediate scale up event.

  • Scaling Cooldown Threshold: Replicator will ensure that the required amount of time has passed since the last scaling event was triggered for the worker pool. If the cooldown threshold has not been met, the safety check fails.

  • Scaling Threshold: Each time Replicator evaluates a worker pool and determines a scaling event is required, the scaling event request is recorded in the state tracking object. Replicator will check to see if the worker pool has requested the required number of consecutive scaling requests. If the consecutive scaling request count has not been met, the safety check fails.