Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Commit

Permalink
Enable Tasks to specify their own custom maintenance SLA.
Browse files Browse the repository at this point in the history
`Tasks` can specify custom SLA requirements as part of
their `TaskConfig`. One of the new features is the ability
to specify an external coordinator that can ACK/NACK
maintenance requests for tasks. This will be hugely
beneficial for onboarding services that cannot satisfactorily
specify SLA in terms of running instances.

Maintenance requests are driven from the Scheduler to
improve management of nodes in the cluster.

Testing Done:
./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Bugs closed: AURORA-1978

Reviewed at https://reviews.apache.org/r/66716/
  • Loading branch information
shanmugh committed Jun 5, 2018
1 parent 34be631 commit f2acf53
Show file tree
Hide file tree
Showing 49 changed files with 3,550 additions and 136 deletions.
32 changes: 32 additions & 0 deletions RELEASE-NOTES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,35 @@
0.21.0
======

### New/updated:
- Introduce ability for tasks to specify custom SLA requirements via the new `SlaPolicy` structs.
There are 3 different SLA Policies that are currently supported - `CountSlaPolicy`,
`PercentageSlaPolicy` and `CoordinatorSlaPolicy`. SLA policies based on count and percentage
express the required number of `RUNNING` instances as either a count or percentage in addition to
allowing the duration-window for which these requirements have to be satisfied. For applications
that need more control over how SLA is determined, a custom SLA calculator can be configured a.k.a
Coordinator. Any action that can affect SLA, will first check with the Coordinator before
proceeding.

**IMPORTANT: The storage changes required for this feature will make scheduler
snapshot backwards incompatible. Scheduler will be unable to read snapshot if rolled back to
previous version. If rollback is absolutely necessary, perform the following steps:**
1. Stop all host maintenance requests via `aurora_admin host_activate`.
2. Ensure a new snapshot is created by running `aurora_admin scheduler_snapshot <cluster>`
3. Rollback to previous version

Note: The `Coordinator` interface required for the `CoordinatorSlaPolicy` is experimental at
this juncture and is bound to change in the future.

### Deprecations and removals:

- Deprecated the `aurora_admin host_drain` command used for maintenance. With this release the SLA
computations are moved to the scheduler and it is no longer required for the client to compute
SLAs and watch the drains. The scheduler persists any host maintenance request and performs
SLA-aware drain of the tasks, before marking the host as `DRAINED`. So maintenance requests
survive across scheduler fail-overs. Use the newly introduced `aurora_admin sla_host_drain`
to skip the SLA computations on the admin client.

0.20.0
======

Expand Down
6 changes: 6 additions & 0 deletions api/src/main/thrift/org/apache/aurora/gen/api.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -1244,6 +1244,12 @@ service AuroraAdmin extends AuroraSchedulerManager {
/** Set the given hosts back into serving mode. */
Response endMaintenance(1: Hosts hosts)

/**
* Ask scheduler to put hosts into DRAINING mode and move scheduled tasks off of the hosts
* such that its SLA requirements are satisfied. Use defaultSlaPolicy if it is not set for a task.
**/
Response slaDrainHosts(1: Hosts hosts, 2: SlaPolicy defaultSlaPolicy, 3: i64 timeoutSecs)

/** Start a storage snapshot and block until it completes. */
Response snapshot()

Expand Down
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Description of important Aurora features.
* [Services](features/services.md)
* [Service Discovery](features/service-discovery.md)
* [SLA Metrics](features/sla-metrics.md)
* [SLA Requirements](features/sla-requirements.md)
* [Webhooks](features/webhooks.md)

## Operators
Expand Down
181 changes: 181 additions & 0 deletions docs/features/sla-requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
SLA Requirements
================

- [Overview](#overview)
- [Default SLA](#default-sla)
- [Custom SLA](#custom-sla)
- [Count-based](#count-based)
- [Percentage-based](#percentage-based)
- [Coordinator-based](#coordinator-based)

## Overview

Aurora guarantees SLA requirements for jobs. These requirements limit the impact of cluster-wide
maintenance operations on the jobs. For instance, when an operator upgrades
the OS on all the Mesos agent machines, the tasks scheduled on them needs to be drained.
By specifying the SLA requirements a job can make sure that it has enough instances to
continue operating safely without incurring downtime.

> SLA is defined as minimum number of active tasks required for a job every duration window.
A task is active if it was in `RUNNING` state during the last duration window.

There is a [default](#default-sla) SLA guarantee for
[preferred](../features/multitenancy.md#configuration-tiers) tier jobs and it is also possible to
specify [custom](#custom-sla) SLA requirements.

## Default SLA

Aurora guarantees a default SLA requirement for tasks in
[preferred](../features/multitenancy.md#configuration-tiers) tier.

> 95% of tasks in a job will be `active` for every 30 mins.

## Custom SLA

For jobs that require different SLA requirements, Aurora allows jobs to specify their own
SLA requirements via the `SlaPolicies`. There are 3 different ways to express SLA requirements.

### [Count-based](../reference/configuration.md#countslapolicy-objects)

For jobs that need a minimum `number` of instances to be running all the time,
[`CountSlaPolicy`](../reference/configuration.md#countslapolicy-objects)
provides the ability to express the minimum number of required active instances (i.e. number of
tasks that are `RUNNING` for at least `duration_secs`). For instance, if we have a
`replicated-service` that has 3 instances and needs at least 2 instances every 30 minutes to be
treated healthy, the SLA requirement can be expressed with a
[`CountSlaPolicy`](../reference/configuration.md#countslapolicy-objects) like below,

```python
Job(
name = 'replicated-service',
role = 'www-data',
instances = 3,
sla_policy = CountSlaPolicy(
count = 2,
duration_secs = 1800
)
...
)
```

### [Percentage-based](../reference/configuration.md#percentageslapolicy-objects)

For jobs that need a minimum `percentage` of instances to be running all the time,
[`PercentageSlaPolicy`](../reference/configuration.md#percentageslapolicy-objects) provides the
ability to express the minimum percentage of required active instances (i.e. percentage of tasks
that are `RUNNING` for at least `duration_secs`). For instance, if we have a `webservice` that
has 10000 instances for handling peak load and cannot have more than 0.1% of the instances down
for every 1 hr, the SLA requirement can be expressed with a
[`PercentageSlaPolicy`](../reference/configuration.md#percentageslapolicy-objects) like below,

```python
Job(
name = 'frontend-service',
role = 'www-data',
instances = 10000,
sla_policy = PercentageSlaPolicy(
percentage = 99.9,
duration_secs = 3600
)
...
)
```

### [Coordinator-based](../reference/configuration.md#coordinatorslapolicy-objects)

When none of the above methods are enough to describe the SLA requirements for a job, then the SLA
calculation can be off-loaded to a custom service called the `Coordinator`. The `Coordinator` needs
to expose an endpoint that will be called to check if removal of a task will affect the SLA
requirements for the job. This is useful to control the number of tasks that undergoes maintenance
at a time, without affected the SLA for the application.

Consider the example, where we have a `storage-service` stores 2 replicas of an object. Each replica
is distributed across the instances, such that replicas are stored on different hosts. In addition
a consistent-hash is used for distributing the data across the instances.

When an instance needs to be drained (say for host-maintenance), we have to make sure that at least 1 of
the 2 replicas remains available. In such a case, a `Coordinator` service can be used to maintain
the SLA guarantees required for the job.

The job can be configured with a
[`CoordinatorSlaPolicy`](../reference/configuration.md#coordinatorslapolicy-objects) to specify the
coordinator endpoint and the field in the response JSON that indicates if the SLA will be affected
or not affected, when the task is removed.

```python
Job(
name = 'storage-service',
role = 'www-data',
sla_policy = CoordinatorSlaPolicy(
coordinator_url = 'http://coordinator.example.com',
status_key = 'drain'
)
...
)
```


#### Coordinator Interface [Experimental]

When a [`CoordinatorSlaPolicy`](../reference/configuration.md#coordinatorslapolicy-objects) is
specified for a job, any action that requires removing a task
(such as drains) will be required to get approval from the `Coordinator` before proceeding. The
coordinator service needs to expose a HTTP endpoint, that can take a `task-key` param
(`<cluster>/<role>/<env>/<name>/<instance>`) and a json body describing the task
details and return a response json that will contain the boolean status for allowing or disallowing
the task's removal.

##### Request:
```javascript
POST /
?task=<cluster>/<role>/<env>/<name>/<instance>

{
"assignedTask": {
"taskId": "taskA",
"slaveHost": "a",
"task": {
"job": {
"role": "role",
"environment": "devel",
"name": "job"
},
...
},
"assignedPorts": {
"http": 1000
},
"instanceId": 1
...
},
...
}
```
##### Response:
```json
{
"drain": true
}
```
If Coordinator allows removal of the task, then the task’s
[termination lifecycle](../reference/configuration.md#httplifecycleconfig-objects)
is triggered. If Coordinator does not allow removal, then the request will be retried again in the
future.
#### Coordinator Actions
Coordinator endpoint get its own lock and this is used to serializes calls to the Coordinator.
It guarantees that only one concurrent request is sent to a coordinator endpoint. This allows
coordinators to simply look the current state of the tasks to determine its SLA (without having
to worry about in-flight and pending requests). However if there are multiple coordinators,
maintenance can be done in parallel across all the coordinators.
_Note: Single concurrent request to a coordinator endpoint does not translate as exactly-once
guarantee. The coordinator must be able to handle duplicate drain
requests for the same task._
55 changes: 48 additions & 7 deletions docs/operations/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,19 +312,19 @@ increased).

To enable this in the Scheduler, you can set the following options:

--enable_update_affinity=true
--update_affinity_reservation_hold_time=3mins
-enable_update_affinity=true
-update_affinity_reservation_hold_time=3mins

You will need to tune the hold time to match the behavior you see in your cluster. If you have extremely
high update throughput, you might have to extend it as processing updates could easily add significant
delays between scheduling attempts. You may also have to tune scheduling parameters to achieve the
throughput you need in your cluster. Some relevant settings (with defaults) are:

--max_schedule_attempts_per_sec=40
--initial_schedule_penalty=1secs
--max_schedule_penalty=1mins
--scheduling_max_batch_size=3
--max_tasks_per_schedule_attempt=5
-max_schedule_attempts_per_sec=40
-initial_schedule_penalty=1secs
-max_schedule_penalty=1mins
-scheduling_max_batch_size=3
-max_tasks_per_schedule_attempt=5

There are metrics exposed by the Scheduler which can provide guidance on where the bottleneck is.
Example metrics to look at:
Expand All @@ -337,3 +337,44 @@ Example metrics to look at:
Most likely you'll run into limits with the number of update instances that can be processed per minute
before you run into any other limits. So if your total work done per minute starts to exceed 2k instances,
you may need to extend the update_affinity_reservation_hold_time.

## Cluster Maintenance

Aurora performs maintenance related task drains. One of the scheduler options that can control
how often the scheduler polls for maintenance work can be controlled via,

-host_maintenance_polling_interval=1min

## Enforcing SLA limitations

Since tasks can specify their own `SLAPolicy`, the cluster needs to limit these SLA requirements.
Too aggressive a requirement can permanently block any type of maintenance work
(ex: OS/Kernel/Security upgrades) on a host and hold it hostage.

An operator can control the limits for SLA requirements via these scheduler configuration options:

-max_sla_duration_secs=2hrs
-min_required_instances_for_sla_check=20

_Note: These limits only apply for `CountSlaPolicy` and `PercentageSlaPolicy`._

### Limiting Coordinator SLA

With `CoordinatorSlaPolicy` the SLA calculation is off-loaded to an external HTTP service. Some
relevant scheduler configuration options are,

-sla_coordinator_timeout=1min
-max_parallel_coordinated_maintenance=10

Since handing off the SLA calculation to an external service can potentially block maintenance
on hosts for an indefinite amount of time (either due to a mis-configured coordinator or due to
a valid degraded service). In those situations the following metrics will be helpful to identify the
offending tasks.

sla_coordinator_user_errors_* (counter tracking number of times the coordinator for the task
returned a bad response.)
sla_coordinator_errors_* (counter tracking number of times the scheduler was not able
to communicate with the coordinator of the task.)
sla_coordinator_lock_starvation_* (counter tracking number of times the scheduler was not able to
get the lock for the coordinator of the task.)

34 changes: 32 additions & 2 deletions docs/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ configuration design.
- [Announcer Objects](#announcer-objects)
- [Container Objects](#container)
- [LifecycleConfig Objects](#lifecycleconfig-objects)
- [SlaPolicy Objects](#slapolicy-objects)
- [Specifying Scheduling Constraints](#specifying-scheduling-constraints)
- [Template Namespaces](#template-namespaces)
- [mesos Namespace](#mesos-namespace)
Expand Down Expand Up @@ -343,7 +344,7 @@ Job Schema
```contact``` | String | Best email address to reach the owner of the job. For production jobs, this is usually a team mailing list.
```instances```| Integer | Number of instances (sometimes referred to as replicas or shards) of the task to create. (Default: 1)
```cron_schedule``` | String | Cron schedule in cron format. May only be used with non-service jobs. See [Cron Jobs](../features/cron-jobs.md) for more information. Default: None (not a cron job.)
```cron_collision_policy``` | String | Policy to use when a cron job is triggered while a previous run is still active. KILL_EXISTING Kill the previous run, and schedule the new run CANCEL_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING)
```cron_collision_policy``` | String | Policy to use when a cron job is triggered while a previous run is still active. KILL\_EXISTING Kill the previous run, and schedule the new run CANCEL\_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING)
```update_config``` | ```UpdateConfig``` object | Parameters for controlling the rate and policy of rolling updates.
```constraints``` | dict | Scheduling constraints for the tasks. See the section on the [constraint specification language](#specifying-scheduling-constraints)
```service``` | Boolean | If True, restart tasks regardless of success or failure. (Default: False)
Expand All @@ -359,6 +360,7 @@ Job Schema
```partition_policy``` | ```PartitionPolicy``` object | An optional partition policy that allows job owners to define how to handle partitions for running tasks (in partition-aware Aurora clusters)
```metadata``` | list of ```Metadata``` objects | list of ```Metadata``` objects for user's customized metadata information.
```executor_config``` | ```ExecutorConfig``` object | Allows choosing an alternative executor defined in `custom_executor_config` to be used instead of Thermos. Tasks will be launched with Thermos as the executor by default. See [Custom Executors](../features/custom-executors.md) for more info.
```sla_policy``` | Choice of ```CountSlaPolicy```, ```PercentageSlaPolicy``` or ```CoordinatorSlaPolicy``` object | An optional SLA policy that allows job owners to describe the SLA requirements for the job. See [SlaPolicy Objects](#slapolicy-objects) for more information.


### UpdateConfig Objects
Expand Down Expand Up @@ -564,7 +566,7 @@ See [Docker Command Line Reference](https://docs.docker.com/reference/commandlin
```graceful_shutdown_wait_secs``` | Integer | The amount of time (in seconds) to wait after hitting the ```graceful_shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)
```shutdown_wait_secs``` | Integer | The amount of time (in seconds) to wait after hitting the ```shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)

#### graceful_shutdown_endpoint
#### graceful\_shutdown\_endpoint

If the Job is listening on the port as specified by the HttpLifecycleConfig
(default: `health`), a HTTP POST request will be sent over localhost to this
Expand All @@ -581,6 +583,34 @@ does not shut down on its own after `shutdown_wait_secs` seconds, it will be
forcefully killed.


### SlaPolicy Objects

Configuration for specifying custom [SLA requirements](../features/sla-requirements.md) for a job. There are 3 supported SLA policies
namely, [`CountSlaPolicy`](#countslapolicy-objects), [`PercentageSlaPolicy`](#percentageslapolicy-objects) and [`CoordinatorSlaPolicy`](#coordinatorslapolicy-objects).


### CountSlaPolicy Objects

param | type | description
----- | :----: | -----------
```count``` | Integer | The number of active instances required every `durationSecs`.
```duration_secs``` | Integer | Minimum time duration a task needs to be `RUNNING` to be treated as active.

### PercentageSlaPolicy Objects

param | type | description
----- | :----: | -----------
```percentage``` | Float | The percentage of active instances required every `durationSecs`.
```duration_secs``` | Integer | Minimum time duration a task needs to be `RUNNING` to be treated as active.

### CoordinatorSlaPolicy Objects

param | type | description
----- | :----: | -----------
```coordinator_url``` | String | The URL to the [Coordinator](../features/sla-requirements.md#coordinator) service to be contacted before performing SLA affecting actions (job updates, host drains etc).
```status_key``` | String | The field in the Coordinator response that indicates the SLA status for working on the task. (Default: `drain`)


Specifying Scheduling Constraints
=================================

Expand Down
Loading

0 comments on commit f2acf53

Please sign in to comment.