Skip to content

Latest commit

 

History

History
288 lines (241 loc) · 13.4 KB

README.md

File metadata and controls

288 lines (241 loc) · 13.4 KB

EWD998

Markus A Kuppe (2022-05-03)

The specification in this directory models termination detection in a ring as given by Shmuel Safra in EWD 998. EWD 998 is an extension of a simpler algorithm described in EWD 840. Compared to EWD 840, this algorithm supports asynchronous message delivery. However, it still assumes reliable message channels and non-faulty nodes. For TLA+ learners, it is best to study EWD840 first and then read the TLA+ spec of EWD998.tla.

For readers familiar with TLA+:

EWD998.tla refines the abstraction AsyncTerminationDetection under the refinement mapping given in EWD998.tla. This mapping has been model-checked with TLC. Additionally, the spec AsyncTerminationDetection_proof proves the main safety properties for module AsyncTerminationDetection, and refinement of SyncTerminationDetection.tla. Elsewhere, SyncTerminationDetection is refined by EWD840.tla.

SyncTerminationDetection < EWD840
                         < AsyncTerminationDetection 
                              < EWD998
                                  < EWD998Chan
                                      < EWD998ChanID

The spec EWD998Chan.tla refines EWD998.tla by adding inboxes. EWD998ChanID.tla refines EWD998Chan.tla and replace natural numbers as node identifiers with hostnames (arbitrary strings actually). EWD998ChanID.tla also shows how to model vector clocks in TLA+, which EWD998ChanID_shiviz.tla exploits to visualize traces with Shiviz. The spec EWD998ChanID_export.tla demonstrates how to export TLC traces to Json. Below is an animation of the termination detection process for six nodes. It was created with EWD998_anim.tla.

Termination detection with EWD998 with six processes

Statistics

Because of its simplicity, EWD998 has been chosen to explore the idea of inferring statistical properties by simulating TLA+ specs (with TLC’s simulator). In a nutshell, we randomly generates traces for real-world configurations (node numbers) or workloads (global state of the system when terminated), and predict the runtime behavior of the system. Such properties cannot be expressed with TLA+ because they are true or false of a set of behaviors; not a single behavior. If random simulation is good enough in practice to predict statistical properties, it could help short-circuit modeling of an algorithm that currently requires empirical analysis of an algorithm’s implementation. In other words, traditionally we would model EWD998 in TLA+ and check safey and liveness properties up to proving correctness. Then, we would implement EWD998 in a suitable programming language and analyze its statistical properties by running the implementation. If the statistical properties are unsatisfactory, we rinse and repeat and go back and optimize the algorithm.

In the case of EWD998, we pretend that we wish to study the efficiency of four different variants of the original termination detection algorithm:

  1. “pt1”: An active node may pass the token if the node is black/tainted.
  2. “pt2”: An active node may pass the token if the token is black/tainted.
  3. “pt3”: Return the token to the initiator, thus, abort the token round, iff the node is black.
  4. “pt4”: Return the token to the initiator, thus, abort the token round, iff the token is black.

pt3 and pt4 can be seen as “aborting” an inconclusive token round by directly returning the token to the initiator. The two variants come at the cost of all nodes knowing the identify of the initiator. However, this could be addressed by stamping the initiator’s id onto the token under the assumption that the underlying network allows for all nodes to send a token to the initiator (i.e. not a ring).

The way we are going to measure efficiency is by measuring the number of steps between termination of Environment and the detection of termination by System (see EWD998.tla for their definitions).

Simulation

In this section, we outline the simluation of the TLA+ specs and discuss our results.

The spec EWD998_opts.tla extends module EWD998 (EWD998Chan actually) and, depending on the given “feature flags” F, enables the variants pt1 to pt4 in the sub-action PassTokenOpts:

PassTokenOpts(n) ==
  /\ n # 0
  /\ \/ ~ active[n]
     \/ /\ "pt1" \in F
        /\ color[n] = "black"
     \/ /\ "pt2" \in F
        /\ \E j \in 1..Len(inbox[n]) : inbox[n][j].type = "tok" /\ inbox[n][j].color = "black"
  /\ \E j \in 1..Len(inbox[n]) : 
          /\ inbox[n][j].type = "tok"
          /\ LET tkn == inbox[n][j]
             IN  inbox' = [inbox EXCEPT ![CASE "pt3" \in F /\ color[n] = "black" -> 0
                                            [] "pt4" \in F /\ tkn.color ="black" -> 0
                                            [] OTHER    ->  n-1] = 
                                       Append(@, [tkn EXCEPT !.q = tkn.q + counter[n],
                                                             !.color = IF color[n] = "black"
                                                                       THEN "black"
                                                                       ELSE tkn.color]),
                                    ![n] = RemoveAt(@, j) ]
  /\ color' = [ color EXCEPT ![n] = "white" ]
  /\ UNCHANGED <<active, counter>>

The TLC action constraint AtTermination sets the first element of the list equal to TLCGet to the ordinal of the state in which all nodes are inactive and no messages are in-flight.

AtTermination ==
    IF EWD998!Termination # EWD998!Termination'
    THEN TLCSet(1, TLCGet("level"))
    ELSE TRUE

The TLC state constraint AtTerminationDetected resets the first element of the list TLCGet, after writing a record to the CSV file defined by the environment variable IOEnv.Out. AtTerminationDetected also writes the occurrences of all sub-actions in the current behavior to the CSV file (note new TLCGet("stats").behavior.actions), as well as the values of the constants F and N.

AtTerminationDetected ==
    EWD998!terminationDetected =>
    /\ LET o == TLCGet("stats").behavior.actions
       IN /\ CSVWrite("%1$s#%2$s#%3$s#%4$s#%5$s#%6$s#%7$s#%8$s#%9$s",
               << F, N, TLCGet("level"), TLCGet("level") - TLCGet(1),
                 o["InitiateProbe"],o["PassTokenOpts"], \* Note "Opts" suffix!
                 o["SendMsg"],o["RecvMsg"],o["Deactivate"]
                 >>,
               IOEnv.Out)
          /\ TLCSet(1, 0)

There is more boilerplate and re-definitions (SomeRingOfNodes, InitSim, …) to help TLC in EWD99_opts.tla, but none are relevant with regards to statistics. If we wouldn’t be studying pt1 and pt2, we could rewrite InitSim and SpecOpts to start in states where all nodes are inactive and no Environment steps are possible.

We then generate a reasonable number of traces for all elements of (SUBSET {pt1,pt2,pt3,pt4}) \X {7,29,43}. This is done in the TLA+ “script” EWD998_optsSC.tla. The average number of token passes are plotted in the barplots below.

Note that the underlying R code can be found in README.Rmd, from which this file is generated on a host with all libraries installed via the command Rscript -e "rmarkdown::render('README.Rmd')".

Average length of suffix for which terminated /\ ~terminationDetected holds

All three barplots above confirm that the average length of the suffix, for which terminated /\ ~terminationDetected is true, is consistently higher for feature flag subsets that include the pt4 variant and highest for the pt3 variant. Additionally, we see that the signal gets stronger for larger numbers of nodes. This could indicate that the small-scope hypothesis does not transfer to empirical measurements.

Plotting the average occurrences of the spec’s sub-actions shows how the InitiateProbe and, consequently (for each InitiateProbe action there are N PassToken actions), PassToken actions increase for variants pt3 and pt4. The higher number of SendMsg/RecvMsg actions for the original spec is likely statistical noise because it’s limited to N=7.

Average number of sub-actions per behavior

To conclude this section, differential analysis along multiple dimensions (number of nodes, spec variants) appears to be useful for predicting statistical properties of algorithmic variants. The occurrences of sub-actions alone, hints at properties that can be studied further by instrumenting a spec as shown above. However, even wall-clock time to simulate a fixed number of behaviors over the two dimensions number of nodes and spec variants alone, indicated the inferior behavior of pt3–running the simulator took forever to the point that it seemed stuck for pt3 with >100 nodes.

Running statistical analysis at the algorithm/spec level should generally provide a more robust signal because many sources of noise that usually distort measurements at the implementation level do not exist. Related, we don’t fully take advantage of studying properties above the code/implementation level, because we still let the simulator generate behaviors satisfying the original behavior spec that allows Environment steps. It would be faster to generate behavior by defining the set of initial states to be all states in which terminated /\ ~terminationDetected holds. Unfortunately, this is infeasible because the TLC simulator generates all initial states upfront. For increasing numbers of nodes, the cardinality of this set becomes astronomical (under a suitable state constraint to bound the message counters). The Randomization.tla module, otherwise used to validate inductive invariants, is a possible workaround.

Analytical worst-case complexity

With the help of the animation of the original algorithm below, we can deduce that the number of token passes is at most 3N with N the number of nodes (we start termination detection in a state where all nodes have terminated and are black, and the token is at node N-2). The original algorithm requires two inconclusive token rounds to reset all nodes to white and a final, conclusive round to detect termination. In short, the number of required token passes after termination is linear in the size of the ring.

Worst-case termination detection with original EWD998 with eight processes

On the other hand, the animation below reveals that the variant pt3 causes the inconclusive token rounds to be aborted N times, which gives us ((n+1)^2 + (n+1) / 2) - 1 as the overall worst-case number of token passes.

Worst-case termination detection with pt3 variant with eight processes

Variant pt4 has a slightly better worst-case behavior, but it still requires ((N+1)/2)^2+N when N is odd, and ((N+2)/2)^2+N-1 token passes when N is even.

Worst-case termination detection with pt4 variant with eight processes

(pt1 and pt2 have no relevance on the number of token passes between termination and termination detection, because they apply to active nodes only. More importantly, with pt1 and pt2 active, the enablement of sub-action PassTokenOpts changes increasing the probability of the simulator choosing PassTokenOpts to extend the current trace.)

Randomized State Exploration (“Smoke Testing”)

Violations found checking SmokeInit /\ [Next]_vars /\ WF_vars(System) => []Inv /\ ATDSpec with randomized state exploration (“simulation”) starting from a small, randomly chosen subset of the initial states.

README.md is generated from README.Rmd on a host with all libraries installed via:

Rscript -e "rmarkdown::render('README.Rmd')"

Install required libraries and R packages (on macOS) with:

brew install pandoc r
Rscript -e "install.packages(c('rmarkdown', 'ggplot2','dplyr', 'here'), repos='http://cran.us.r-project.org')"