Skip to content

Commit 7e3b2cd

Browse files
committed
[hotfix][tests] Document flink-jepsen correctness model
1 parent 8461066 commit 7e3b2cd

File tree

1 file changed

+12
-3
lines changed

1 file changed

+12
-3
lines changed

flink-jepsen/README.md

+12-3
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,24 @@ distributed coordination of Apache Flink®.
55

66
## Test Coverage
77
Jepsen is a framework built to test the behavior of distributed systems
8-
under faults. The tests in this particular project deploy Flink on YARN, Mesos, or as a standalone cluster, submit a
9-
job, and examine the availability of the job after injecting faults.
10-
A job is said to be available if all the tasks of the job are running.
8+
under faults. The tests in this particular project deploy Flink on YARN, Mesos, or as a standalone cluster,
9+
submit one or multiple jobs, and examine the availability of the job(s) after injecting faults.
10+
Optionally, we can cancel the job(s) during the test.
1111
The faults that can be currently introduced to the Flink cluster include:
1212

1313
* Killing of TaskManager/JobManager processes
1414
* Stopping HDFS NameNode
1515
* Network partitions
1616

17+
### Checking Correctness
18+
We define a job to be available if all the tasks of the job are running.
19+
Our correctness model prescribes that:
20+
* Jobs should become available within the [_job recovery grace period_](#command-line-options--configuration)
21+
after the last injected fault. Note that some faults happen at a single point in time (e.g., killing of processes).
22+
Other faults, such as network splits, happen during a period of time, and can thus be interleaving.
23+
As long as there are active faults, jobs are allowed to be unavailable.
24+
* If jobs are canceled, they must become unavailable within 10 seconds of the cancellation.
25+
1726
## Usage
1827

1928
### Setting up the Environment

0 commit comments

Comments
 (0)