Skip to content
This repository has been archived by the owner on Jan 31, 2019. It is now read-only.

Job Scaling

Rampal Chopra edited this page Nov 21, 2018 · 1 revision

Autoscaling Nomad jobs is a key part of Replicators functionality and jobs are configured using meta parameters. A Nomad job the count is specified at the Nomad group level, therefore Replicator supports scaling jobs independently at the group level.

Job Scaling Configuration

Replicator's job scaling behavior and limits are similar to AWS AutoScaling and should be provided for each job group you wish to dynamically scale. Replicator's current behavior is to scale the job group by a count of 1 every time, whether it be a scale out or a scale in event. A number of factors contributed to this design decision, mostly being the speed at which Nomad, Replicator and Docker can perform the scaling actions. If you believe this increment/decrement count should be configurable we are certainly open to discussing adding this functionality.

Job Scaling Scaling Parameters

The parameters that Replicator uses to configure job scaling can be seen below. Currently on the replicator_cooldown has a configured default and therefore does not need to be explicitly set; this is likely to change in the future.

  • replicator_cooldown (int 60) The time period in seconds that Replicator must wait between scaling activities.

  • replicator_enabled (bool) This dictates whether or not scaling for the job is enabled and disabled. This is helpful for initial testing and tuning of scaling documents and jobs, or allows for scaling on particular job groups to be turned off/on.

  • replicator_max (int) The maximum count of job group which must not be violated.

  • replicator_min (int) The minimum count of job group which must not be violated.

  • replicator_scalein_mem (int) The MEM threshold in percentage which if violated, Replicator will scale the group out by 1.

  • replicator_scalein_cpu (int) The CPU threshold in percentage which if violated, Replicator will scale the group out by 1.

  • replicator_scaleout_mem (int) The MEM threshold in percentage which if violated, Replicator will scale the group out by 1.

  • replicator_scaleout_cpu (int) The CPU threshold in percentage which if violated, Replicator will scale the group out by 1.

  • replicator_notification_uid (string) When sending a notification about this job group, the UID will be used to distinguish the alert.

Job Scaling Scaling Meta Configuration

To keep a job completely self contained, the meta parameters for scaling are held in the job specification. Each group that requires autoscaling should have a meta stanza defined which contains the required Replicator parameters. An example scaling document which would work with the Nomad example job would look like:

meta {
  "replicator_cooldown"         = 30
  "replicator_enabled"          = true
  "replicator_max"              = 10
  "replicator_min"              = 2
  "replicator_scalein_mem"      = 30
  "replicator_scalein_cpu"      = 30
  "replicator_scaleout_mem"     = 80
  "replicator_scaleout_cpu"     = 80
  "replicator_notification_uid" = "REP1"
}

It is also possible to use the replicator init -job-scaling command to write out an example policy which can be modified to suit your needs.

Job Scaling Failures

Replicator fully confirms the success of a scaling action using Nomad deployments and a watcher process. In the event of a failure, the job group will be placed into failsafe mode and a notification sent to any configured notification backends alerting operators to the issue. Once an operator has confirmed the issue has been resolved, the failsafe lock should be removed using Replicator's failsafe command. Once the lock is removed, scaling will be re-enabled and resume normal behaviour.