Enable Tasks to specify their own custom maintenance SLA.

`Tasks` can specify custom SLA requirements as part of their `TaskConfig`. One of the new features is the ability to specify an external coordinator that can ACK/NACK maintenance requests for tasks. This will be hugely beneficial for onboarding services that cannot satisfactorily specify SLA in terms of running instances. Maintenance requests are driven from the Scheduler to improve management of nodes in the cluster. Testing Done: ./build-support/jenkins/build.sh ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh Bugs closed: AURORA-1978 Reviewed at https://reviews.apache.org/r/66716/
apache · Jun 5, 2018 · f2acf53 · f2acf53
1 parent 34be631
commit f2acf53
Show file tree

Hide file tree

Showing 49 changed files with 3,550 additions and 136 deletions.
diff --git a/RELEASE-NOTES.md b/RELEASE-NOTES.md
@@ -1,3 +1,35 @@
+0.21.0
+======
+
+### New/updated:
+- Introduce ability for tasks to specify custom SLA requirements via the new `SlaPolicy` structs.
+  There are 3 different SLA Policies that are currently supported - `CountSlaPolicy`,
+  `PercentageSlaPolicy` and `CoordinatorSlaPolicy`. SLA policies based on count and percentage
+  express the required number of `RUNNING` instances as either a count or percentage in addition to
+  allowing the duration-window for which these requirements have to be satisfied. For applications
+  that need more control over how SLA is determined, a custom SLA calculator can be configured a.k.a
+  Coordinator. Any action that can affect SLA, will first check with the Coordinator before
+  proceeding.
+
+    **IMPORTANT: The storage changes required for this feature will make scheduler
+    snapshot backwards incompatible. Scheduler will be unable to read snapshot if rolled back to
+    previous version. If rollback is absolutely necessary, perform the following steps:**
+    1. Stop all host maintenance requests via `aurora_admin host_activate`.
+    2. Ensure a new snapshot is created by running `aurora_admin scheduler_snapshot <cluster>`
+    3. Rollback to previous version
+
+  Note: The `Coordinator` interface required for the `CoordinatorSlaPolicy` is experimental at
+  this juncture and is bound to change in the future.
+
+### Deprecations and removals:
+
+- Deprecated the `aurora_admin host_drain` command used for maintenance. With this release the SLA
+  computations are moved to the scheduler and it is no longer required for the client to compute
+  SLAs and watch the drains. The scheduler persists any host maintenance request and performs
+  SLA-aware drain of the tasks, before marking the host as `DRAINED`. So maintenance requests
+  survive across scheduler fail-overs. Use the newly introduced `aurora_admin sla_host_drain`
+  to skip the SLA computations on the admin client.
+
 0.20.0
 ======
 

diff --git a/api/src/main/thrift/org/apache/aurora/gen/api.thrift b/api/src/main/thrift/org/apache/aurora/gen/api.thrift
@@ -1244,6 +1244,12 @@ service AuroraAdmin extends AuroraSchedulerManager {
   /** Set the given hosts back into serving mode. */
   Response endMaintenance(1: Hosts hosts)
 
+  /**
+   * Ask scheduler to put hosts into DRAINING mode and move scheduled tasks off of the hosts
+   * such that its SLA requirements are satisfied. Use defaultSlaPolicy if it is not set for a task.
+   **/
+  Response slaDrainHosts(1: Hosts hosts, 2: SlaPolicy defaultSlaPolicy, 3: i64 timeoutSecs)
+
   /** Start a storage snapshot and block until it completes. */
   Response snapshot()
 

diff --git a/docs/README.md b/docs/README.md
@@ -28,6 +28,7 @@ Description of important Aurora features.
  * [Services](features/services.md)
  * [Service Discovery](features/service-discovery.md)
  * [SLA Metrics](features/sla-metrics.md)
+ * [SLA Requirements](features/sla-requirements.md)
  * [Webhooks](features/webhooks.md)
 
 ## Operators

diff --git a/docs/features/sla-requirements.md b/docs/features/sla-requirements.md
@@ -0,0 +1,181 @@
+SLA Requirements
+================
+
+- [Overview](#overview)
+- [Default SLA](#default-sla)
+- [Custom SLA](#custom-sla)
+  - [Count-based](#count-based)
+  - [Percentage-based](#percentage-based)
+  - [Coordinator-based](#coordinator-based)
+
+## Overview
+
+Aurora guarantees SLA requirements for jobs. These requirements limit the impact of cluster-wide
+maintenance operations on the jobs. For instance, when an operator upgrades
+the OS on all the Mesos agent machines, the tasks scheduled on them needs to be drained.
+By specifying the SLA requirements a job can make sure that it has enough instances to
+continue operating safely without incurring downtime.
+
+> SLA is defined as minimum number of active tasks required for a job every duration window.
+A task is active if it was in `RUNNING` state during the last duration window.
+
+There is a [default](#default-sla) SLA guarantee for
+[preferred](../features/multitenancy.md#configuration-tiers) tier jobs and it is also possible to
+specify [custom](#custom-sla) SLA requirements.
+
+## Default SLA
+
+Aurora guarantees a default SLA requirement for tasks in
+[preferred](../features/multitenancy.md#configuration-tiers) tier.
+
+> 95% of tasks in a job will be `active` for every 30 mins.
+
+
+## Custom SLA
+
+For jobs that require different SLA requirements, Aurora allows jobs to specify their own
+SLA requirements via the `SlaPolicies`. There are 3 different ways to express SLA requirements.
+
+### [Count-based](../reference/configuration.md#countslapolicy-objects)
+
+For jobs that need a minimum `number` of instances to be running all the time,
+[`CountSlaPolicy`](../reference/configuration.md#countslapolicy-objects)
+provides the ability to express the minimum number of required active instances (i.e. number of
+tasks that are `RUNNING` for at least `duration_secs`). For instance, if we have a
+`replicated-service` that has 3 instances and needs at least 2 instances every 30 minutes to be
+treated healthy, the SLA requirement can be expressed with a
+[`CountSlaPolicy`](../reference/configuration.md#countslapolicy-objects) like below,
+
+```python
+Job(
+  name = 'replicated-service',
+  role = 'www-data',
+  instances = 3,
+  sla_policy = CountSlaPolicy(
+    count = 2,
+    duration_secs = 1800
+  )
+  ...
+)
+```
+
+### [Percentage-based](../reference/configuration.md#percentageslapolicy-objects)
+
+For jobs that need a minimum `percentage` of instances to be running all the time,
+[`PercentageSlaPolicy`](../reference/configuration.md#percentageslapolicy-objects) provides the
+ability to express the minimum percentage of required active instances (i.e. percentage of tasks
+that are `RUNNING` for at least `duration_secs`). For instance, if we have a `webservice` that
+has 10000 instances for handling peak load and cannot have more than 0.1% of the instances down
+for every 1 hr, the SLA requirement can be expressed with a
+[`PercentageSlaPolicy`](../reference/configuration.md#percentageslapolicy-objects) like below,
+
+```python
+Job(
+  name = 'frontend-service',
+  role = 'www-data',
+  instances = 10000,
+  sla_policy = PercentageSlaPolicy(
+    percentage = 99.9,
+    duration_secs = 3600
+  )
+  ...
+)
+```
+
+### [Coordinator-based](../reference/configuration.md#coordinatorslapolicy-objects)
+
+When none of the above methods are enough to describe the SLA requirements for a job, then the SLA
+calculation can be off-loaded to a custom service called the `Coordinator`. The `Coordinator` needs
+to expose an endpoint that will be called to check if removal of a task will affect the SLA
+requirements for the job. This is useful to control the number of tasks that undergoes maintenance
+at a time, without affected the SLA for the application.
+
+Consider the example, where we have a `storage-service` stores 2 replicas of an object. Each replica
+is distributed across the instances, such that replicas are stored on different hosts. In addition
+a consistent-hash is used for distributing the data across the instances.
+
+When an instance needs to be drained (say for host-maintenance), we have to make sure that at least 1 of
+the 2 replicas remains available. In such a case, a `Coordinator` service can be used to maintain
+the SLA guarantees required for the job.
+
+The job can be configured with a
+[`CoordinatorSlaPolicy`](../reference/configuration.md#coordinatorslapolicy-objects) to specify the
+coordinator endpoint and the field in the response JSON that indicates if the SLA will be affected
+or not affected, when the task is removed.
+
+```python
+Job(
+  name = 'storage-service',
+  role = 'www-data',
+  sla_policy = CoordinatorSlaPolicy(
+    coordinator_url = 'http://coordinator.example.com',
+    status_key = 'drain'
+  )
+  ...
+)
+```
+
+
+#### Coordinator Interface [Experimental]
+
+When a [`CoordinatorSlaPolicy`](../reference/configuration.md#coordinatorslapolicy-objects) is
+specified for a job, any action that requires removing a task
+(such as drains) will be required to get approval from the `Coordinator` before proceeding. The
+coordinator service needs to expose a HTTP endpoint, that can take a `task-key` param
+(`<cluster>/<role>/<env>/<name>/<instance>`) and a json body describing the task
+details and return a response json that will contain the boolean status for allowing or disallowing
+the task's removal.
+
+##### Request:
+```javascript
+POST /
+  ?task=<cluster>/<role>/<env>/<name>/<instance>
+
+{
+  "assignedTask": {
+    "taskId": "taskA",
+    "slaveHost": "a",
+    "task": {
+      "job": {
+        "role": "role",
+        "environment": "devel",
+        "name": "job"
+      },
+      ...
+    },
+    "assignedPorts": {
+      "http": 1000
+    },
+    "instanceId": 1
+    ...
+  },
+  ...
+}
+```
+
+##### Response:
+```json
+{
+  "drain": true
+}
+```
+
+If Coordinator allows removal of the task, then the task’s
+[termination lifecycle](../reference/configuration.md#httplifecycleconfig-objects)
+is triggered. If Coordinator does not allow removal, then the request will be retried again in the
+future.
+
+#### Coordinator Actions
+
+Coordinator endpoint get its own lock and this is used to serializes calls to the Coordinator.
+It guarantees that only one concurrent request is sent to a coordinator endpoint. This allows
+coordinators to simply look the current state of the tasks to determine its SLA (without having
+to worry about in-flight and pending requests). However if there are multiple coordinators,
+maintenance can be done in parallel across all the coordinators.
+
+_Note: Single concurrent request to a coordinator endpoint does not translate as exactly-once
+guarantee. The coordinator must be able to handle duplicate drain
+requests for the same task._
+
+
+
diff --git a/docs/operations/configuration.md b/docs/operations/configuration.md
@@ -312,19 +312,19 @@ increased).
 
 To enable this in the Scheduler, you can set the following options:
 
-    --enable_update_affinity=true
-    --update_affinity_reservation_hold_time=3mins
+    -enable_update_affinity=true
+    -update_affinity_reservation_hold_time=3mins
 
 You will need to tune the hold time to match the behavior you see in your cluster. If you have extremely
 high update throughput, you might have to extend it as processing updates could easily add significant
 delays between scheduling attempts. You may also have to tune scheduling parameters to achieve the
 throughput you need in your cluster. Some relevant settings (with defaults) are:
 
-    --max_schedule_attempts_per_sec=40
-    --initial_schedule_penalty=1secs
-    --max_schedule_penalty=1mins
-    --scheduling_max_batch_size=3
-    --max_tasks_per_schedule_attempt=5
+    -max_schedule_attempts_per_sec=40
+    -initial_schedule_penalty=1secs
+    -max_schedule_penalty=1mins
+    -scheduling_max_batch_size=3
+    -max_tasks_per_schedule_attempt=5
 
 There are metrics exposed by the Scheduler which can provide guidance on where the bottleneck is.
 Example metrics to look at:
@@ -337,3 +337,44 @@ Example metrics to look at:
 Most likely you'll run into limits with the number of update instances that can be processed per minute
 before you run into any other limits. So if your total work done per minute starts to exceed 2k instances,
 you may need to extend the update_affinity_reservation_hold_time.
+
+## Cluster Maintenance
+
+Aurora performs maintenance related task drains. One of the scheduler options that can control
+how often the scheduler polls for maintenance work can be controlled via,
+
+    -host_maintenance_polling_interval=1min
+
+## Enforcing SLA limitations
+
+Since tasks can specify their own `SLAPolicy`, the cluster needs to limit these SLA requirements.
+Too aggressive a requirement can permanently block any type of maintenance work
+(ex: OS/Kernel/Security upgrades) on a host and hold it hostage.
+
+An operator can control the limits for SLA requirements via these scheduler configuration options:
+
+    -max_sla_duration_secs=2hrs
+    -min_required_instances_for_sla_check=20
+
+_Note: These limits only apply for `CountSlaPolicy` and `PercentageSlaPolicy`._
+
+### Limiting Coordinator SLA
+
+With `CoordinatorSlaPolicy` the SLA calculation is off-loaded to an external HTTP service. Some
+relevant scheduler configuration options are,
+
+    -sla_coordinator_timeout=1min
+    -max_parallel_coordinated_maintenance=10
+
+Since handing off the SLA calculation to an external service can potentially block maintenance
+on hosts for an indefinite amount of time (either due to a mis-configured coordinator or due to
+a valid degraded service). In those situations the following metrics will be helpful to identify the
+offending tasks.
+
+    sla_coordinator_user_errors_*     (counter tracking number of times the coordinator for the task
+                                       returned a bad response.)
+    sla_coordinator_errors_*          (counter tracking number of times the scheduler was not able
+                                       to communicate with the coordinator of the task.)
+    sla_coordinator_lock_starvation_* (counter tracking number of times the scheduler was not able to
+                                       get the lock for the coordinator of the task.)
+
diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
@@ -23,6 +23,7 @@ configuration design.
     - [Announcer Objects](#announcer-objects)
     - [Container Objects](#container)
     - [LifecycleConfig Objects](#lifecycleconfig-objects)
+    - [SlaPolicy Objects](#slapolicy-objects)
 - [Specifying Scheduling Constraints](#specifying-scheduling-constraints)
 - [Template Namespaces](#template-namespaces)
     - [mesos Namespace](#mesos-namespace)
@@ -343,7 +344,7 @@ Job Schema
   ```contact``` | String | Best email address to reach the owner of the job. For production jobs, this is usually a team mailing list.
   ```instances```| Integer | Number of instances (sometimes referred to as replicas or shards) of the task to create. (Default: 1)
   ```cron_schedule``` | String | Cron schedule in cron format. May only be used with non-service jobs. See [Cron Jobs](../features/cron-jobs.md) for more information. Default: None (not a cron job.)
-  ```cron_collision_policy``` | String | Policy to use when a cron job is triggered while a previous run is still active. KILL_EXISTING Kill the previous run, and schedule the new run CANCEL_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING)
+  ```cron_collision_policy``` | String | Policy to use when a cron job is triggered while a previous run is still active. KILL\_EXISTING Kill the previous run, and schedule the new run CANCEL\_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING)
   ```update_config``` | ```UpdateConfig``` object | Parameters for controlling the rate and policy of rolling updates.
   ```constraints``` | dict | Scheduling constraints for the tasks. See the section on the [constraint specification language](#specifying-scheduling-constraints)
   ```service``` | Boolean | If True, restart tasks regardless of success or failure. (Default: False)
@@ -359,6 +360,7 @@ Job Schema
   ```partition_policy``` | ```PartitionPolicy``` object | An optional partition policy that allows job owners to define how to handle partitions for running tasks (in partition-aware Aurora clusters)
   ```metadata``` | list of ```Metadata``` objects | list of ```Metadata``` objects for user's customized metadata information.
   ```executor_config``` | ```ExecutorConfig``` object | Allows choosing an alternative executor defined in `custom_executor_config` to be used instead of Thermos. Tasks will be launched with Thermos as the executor by default. See [Custom Executors](../features/custom-executors.md) for more info.
+  ```sla_policy``` |  Choice of ```CountSlaPolicy```, ```PercentageSlaPolicy``` or ```CoordinatorSlaPolicy``` object | An optional SLA policy that allows job owners to describe the SLA requirements for the job. See [SlaPolicy Objects](#slapolicy-objects) for more information.
 
 
 ### UpdateConfig Objects
@@ -564,7 +566,7 @@ See [Docker Command Line Reference](https://docs.docker.com/reference/commandlin
   ```graceful_shutdown_wait_secs``` | Integer | The amount of time (in seconds) to wait after hitting the ```graceful_shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)
   ```shutdown_wait_secs```          | Integer | The amount of time (in seconds) to wait after hitting the ```shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)
 
-#### graceful_shutdown_endpoint
+#### graceful\_shutdown\_endpoint
 
 If the Job is listening on the port as specified by the HttpLifecycleConfig
 (default: `health`), a HTTP POST request will be sent over localhost to this
@@ -581,6 +583,34 @@ does not shut down on its own after `shutdown_wait_secs` seconds, it will be
 forcefully killed.
 
 
+### SlaPolicy Objects
+
+Configuration for specifying custom [SLA requirements](../features/sla-requirements.md) for a job. There are 3 supported SLA policies
+namely, [`CountSlaPolicy`](#countslapolicy-objects), [`PercentageSlaPolicy`](#percentageslapolicy-objects) and [`CoordinatorSlaPolicy`](#coordinatorslapolicy-objects).
+
+
+### CountSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```count```                       | Integer | The number of active instances required every `durationSecs`.
+  ```duration_secs```               | Integer | Minimum time duration a task needs to be `RUNNING` to be treated as active.
+
+### PercentageSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```percentage```                  | Float   | The percentage of active instances required every `durationSecs`.
+  ```duration_secs```               | Integer | Minimum time duration a task needs to be `RUNNING` to be treated as active.
+
+### CoordinatorSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```coordinator_url```             | String  | The URL to the [Coordinator](../features/sla-requirements.md#coordinator) service to be contacted before performing SLA affecting actions (job updates, host drains etc).
+  ```status_key```                  | String  | The field in the Coordinator response that indicates the SLA status for working on the task. (Default: `drain`)
+
+
 Specifying Scheduling Constraints
 =================================