@@ -5,15 +5,24 @@ distributed coordination of Apache Flink®.
5
5
6
6
## Test Coverage
7
7
Jepsen is a framework built to test the behavior of distributed systems
8
- under faults. The tests in this particular project deploy Flink on YARN, Mesos, or as a standalone cluster, submit a
9
- job , and examine the availability of the job after injecting faults.
10
- A job is said to be available if all the tasks of the job are running .
8
+ under faults. The tests in this particular project deploy Flink on YARN, Mesos, or as a standalone cluster,
9
+ submit one or multiple jobs , and examine the availability of the job(s) after injecting faults.
10
+ Optionally, we can cancel the job(s) during the test .
11
11
The faults that can be currently introduced to the Flink cluster include:
12
12
13
13
* Killing of TaskManager/JobManager processes
14
14
* Stopping HDFS NameNode
15
15
* Network partitions
16
16
17
+ ### Checking Correctness
18
+ We define a job to be available if all the tasks of the job are running.
19
+ Our correctness model prescribes that:
20
+ * Jobs should become available within the [ _ job recovery grace period_ ] ( #command-line-options--configuration )
21
+ after the last injected fault. Note that some faults happen at a single point in time (e.g., killing of processes).
22
+ Other faults, such as network splits, happen during a period of time, and can thus be interleaving.
23
+ As long as there are active faults, jobs are allowed to be unavailable.
24
+ * If jobs are canceled, they must become unavailable within 10 seconds of the cancellation.
25
+
17
26
## Usage
18
27
19
28
### Setting up the Environment
0 commit comments