|
| 1 | +Polymorphic Summaries for Abstract Definitional Interpreters |
| 2 | +========================================= |
| 3 | + |
| 4 | +[AAM](https://dl.acm.org/doi/10.1145/1863543.1863553)- and |
| 5 | +[ADI](https://dl.acm.org/doi/10.1145/3110256)-based |
| 6 | +analyses' faithfulness to operational semantics brings substantial benefits |
| 7 | +(e.g. easy extension to expressive features with straightforward soundness proofs), |
| 8 | +but also seems to inevitably carry over one downside: |
| 9 | +each component is re-analyzed at each use, which either |
| 10 | +enjoys precision when it works well, or duplicates imprecision and multiplies overhead |
| 11 | +when it doesn't. |
| 12 | +The lack of function summaries is a key hindrance in scaling these analyses to large codebases. |
| 13 | + |
| 14 | +This **in progress** Redex model describes a re-formulation of ADI to produce |
| 15 | +*polymorphic summaries* |
| 16 | +that enables a scalable and precise analysis of incomplete higher-order stateful programs, |
| 17 | +*using only a first-order language for summaries*. |
| 18 | +Summaries for smaller units (e.g. functions, modules, etc.) can be used as-is when analyzing |
| 19 | +arbitrary code without compromising soundness, |
| 20 | +but can also be "linked" together for a more precise summary, |
| 21 | +at a cost less than a from-scratch analysis. |
| 22 | +What's more, ADI naturally yields *memoized top-down*, as opposed to bottom-up summarization, |
| 23 | +avoiding the cost of analyzing the entire codebase when the entry point only reaches a |
| 24 | +small fraction of the code. |
| 25 | + |
| 26 | +High-level Ideas |
| 27 | +----------------------------------------- |
| 28 | + |
| 29 | +Initial idea: We want produce summaries by running each `λ` only once[^once], |
| 30 | +*symbolically*, agnostic to any caller. |
| 31 | +The *polymorphic summary* contains result, effect (e.g. "store delta"), |
| 32 | +and path-condition in terms of the symbolic arguments. |
| 33 | +Then at each call site, we instantiate the summary by substituting the actual arguments, |
| 34 | +potentially eliminating spurious paths. |
| 35 | +[^once]: Unless there's circular dependency, which is taken care of by ADI's cache-fixing loop. |
| 36 | + |
| 37 | +The obvious problem with this idea is: *What should we do with symbolic functions?* |
| 38 | + |
| 39 | +While it's tempting to try to come up with a language of "polymorphic, higher-order summaries" |
| 40 | +similar to a type-and-effect system that somehow is decidable without resorting to trivial |
| 41 | +imprecision, that's a very difficult route, especially for untyped languages and idioms. |
| 42 | +Even if we managed to do it for a fixed set of |
| 43 | +language features, the language of summaries would likely be far removed from the |
| 44 | +operational semantics, making soundness proof and extension to new features challenging. |
| 45 | + |
| 46 | +Another choice that would result in pessimal precision, despite soundness, is to use |
| 47 | +[havoc](https://dl.acm.org/doi/10.1145/3158139) |
| 48 | +to over-approximate all interactions with unknown code whenever in doubt. |
| 49 | +For contract verification, the imprecision can be |
| 50 | +mitigated with more precise contracts at boundaries (e.g. parametric contracts), |
| 51 | +but that isn't viable for static analysis in general. |
| 52 | + |
| 53 | +To combine *almost* the simplicity of mirroring the operational semantics, |
| 54 | +the scalability of function summaries, |
| 55 | +and the sound modularity from `havoc`, we make the following two observations: |
| 56 | + |
| 57 | +1. Whole higher-order programs are essentially first-order programs |
| 58 | + (e.g. by defunctionalization). |
| 59 | + With the closed-world assumption, *higher-order functions only need first-order summaries*. |
| 60 | + |
| 61 | +2. Incomplete programs are essentially whole programs, with unknown code filled in as `havoc`, |
| 62 | + with its own first-order summary. |
| 63 | + |
| 64 | +### "Polymorphic", just not "parametric" |
| 65 | + |
| 66 | +This work is inspired by the line of work on |
| 67 | +[polymorphic method summaries for whole OO programs](https://web.eecs.umich.edu/~xwangsd/pubs/aplas15.pdf), |
| 68 | +where it's valid to make the closed-world assumption that all implementations to each class |
| 69 | +are known. At each virtual method invocation, the analysis simply gathers the summaries |
| 70 | +of *all* known implementations, |
| 71 | +then uses information at the call-site to discard spurious cases |
| 72 | +or propagate constraints on the receiver's dynamic class tag. |
| 73 | +The obtained "polymorphic" summaries are only valid in the specific whole program they're in, |
| 74 | +and would be invalidated by additional method implementations. |
| 75 | + |
| 76 | +For functional programs, this is analogous to analyzing their defunctionalized versions[^defunct]. |
| 77 | +Specifically, each `(λ x e)` form has a first-order summary of `e`'s result and effect |
| 78 | +(as a *store delta*), with respect to a *symbolic environment* ranging over `e`'s free variables. |
| 79 | +Each application's results and effects are gathered from the summaries of all known `λ`s, |
| 80 | +each guarded by a constraint over the target closure's "`λ`-tag". |
| 81 | +As long as the program is closed, the summaries are sound. |
| 82 | +Summary instantiation is a straight-forward substitution of the symbolic environment with |
| 83 | +the actual one from the target closure, with no further call to evaluation. |
| 84 | +Summary instantiation is sensitive in the closure's `λ`-tag, its environment, and the argument. |
| 85 | + |
| 86 | +[^defunct]: Although we don't need to explicitly defunctionalize the program. |
| 87 | + |
| 88 | +Note that there is no such thing as "applying a symbolic function then obtaining its symbolic result |
| 89 | +and effect" in this system, thanks to the "defunctionalized view". |
| 90 | +When "in doubt", we case-split over all known `λ`s then use respective first-order summaries |
| 91 | +guarded by constraints over the `λ`-tag. |
| 92 | + |
| 93 | +Although it sounds expensive to consider all `λ`s in many cases, remember that we are |
| 94 | +pumping memoized summaries around, not re-running analyses. |
| 95 | + |
| 96 | +### Removing the closed-world assumption to enable modularity |
| 97 | + |
| 98 | +Whole-program analyses are useful, and separating the *scalability* goal from the stronger |
| 99 | +*modularity* goal is neat, but we sometimes need modularity, such as verifying a component |
| 100 | +once and for all against arbitrary uses. |
| 101 | + |
| 102 | +Luckily, there's |
| 103 | +[previous work on soundly over-approximating unknown code in the presence of |
| 104 | +higher-order functions and mutable states](https://dl.acm.org/doi/10.1145/3158139) |
| 105 | +using an operation called `havoc`. |
| 106 | +To generalize function summaries for incomplete programs, we simply add `havoc`'s summaries |
| 107 | +for use when (1) applying a closure whose `λ`-tag is from the unknown, |
| 108 | +and (2) when a `λ` from one module flows to another module that doesn't already "know" it. |
| 109 | +As a result, we get *summaries that are sound even in an open world, |
| 110 | +and also precise for known code*. |
| 111 | + |
| 112 | +When analyzing code that uses both modules `m1` and `m2`, we could either re-use their |
| 113 | +separately produced summaries as-is, or "link"[^link] those summaries by |
| 114 | +re-running ADI's cache-fixing loop over the merged summaries and produce a more precise |
| 115 | +summary of both `m1` and `m2`. |
| 116 | + |
| 117 | +[^link]:The term "linking" only conveys a loose analogy, unfortunately: |
| 118 | +we still need `m1` and `m2`'s source code to re-run ADI. But the process is cheaper |
| 119 | +than re-running ADI from scratch. |
| 120 | + |
| 121 | +### ADI for memoized top-down instead of bottom-up |
| 122 | + |
| 123 | +Most work on function summaries present the process as eager bottom-up. As well understood from |
| 124 | +dynamic programming, one drawback compared to (memoized) top-down is the potentially needless |
| 125 | +computations when only a fraction of the codebase is reachable from the entry point. |
| 126 | + |
| 127 | +ADI naturally carries out the semantics top-down memoized, starting from the entry |
| 128 | +point. We track the `λ`s whose closures have been created, and only consider summaries for those, |
| 129 | +on-demand, at applications. This potentially requires more iterations to reach a fix-point, |
| 130 | +but can save lots of work when the reachable code is small compared to the entire codebase. |
| 131 | + |
| 132 | +The Redex Model (NOT FINISHED!!) |
| 133 | +----------------------------------------- |
| 134 | + |
| 135 | +The Redex model is for a *modified concrete semantics that summarizes each function as |
| 136 | +first-order values and store-deltas*. |
| 137 | +This semantics should compute the same result as the standard operational semantics[^proof], |
| 138 | +and can be systematically finitized (ala AAM/ADI) to obtain a static analysis. |
| 139 | +The model is minimal and omits `havoc`. |
| 140 | +The language is λ-calculus with `set!`. |
| 141 | + |
| 142 | +[^proof]: TODO needs proof. But this can be done once, then all abstractions by finitization and non-determinism should be sound. |
| 143 | + |
| 144 | +### *Symbolic* vs *transparent* |
| 145 | + |
| 146 | +There's a distinction between *symbolic* and *transparent* addresses `α` and values `v`. |
| 147 | +When analyzing a function: |
| 148 | +* A *symbolic address* stands for one that is passed in from an arbitrary caller. |
| 149 | + All values and addresses initially reachable from it are also symbolic. |
| 150 | + No aliasing information is assumed. |
| 151 | +* A *transparent address* is one allocated either directly within the function's body |
| 152 | + or indirectly from its callees. Aliasing among transparent addresses are always |
| 153 | + soundly tracked and approximated. In particular, transparent addresses cannot be aliased |
| 154 | + by symbolic addresses, by construction. |
| 155 | + |
| 156 | +Unlike standard ADI, this system: |
| 157 | +* only memoizes function bodies (as opposed to all sub-expressions) |
| 158 | +* memoizes symbolic results to be instantiated at each call site |
| 159 | + (as opposed to already context-sensitive results to be returned as-is). |
| 160 | + |
| 161 | +In this formalism, application is the only place where we allocate a transparent address. |
| 162 | +Given the callee's result, we specialize it by: |
| 163 | +* substituting its symbolic addresses with the caller's ones (with `⟹`) |
| 164 | +* consistently renaming its transparent addresses by attaching the caller's context `ℓ` |
| 165 | + (with `tick`; the name is inspired by AAM's work, but its use is in contrast to classic AAM |
| 166 | + and ADI that propagates contexts from callers to callees.) |
| 167 | +* doing the obvious constraint propagation, path pruning, and effect joining (with `⊔σ`). |
| 168 | + |
| 169 | +### Path-condition representation |
| 170 | + |
| 171 | +Instead of representing conditions per-path, we guard values (and therefore chunks |
| 172 | +of effect) with smaller chunks of the path-condition, and take their conjunction at appropriate |
| 173 | +places. This representation prevents eager splitting and duplicating of many constructs. |
| 174 | + |
| 175 | +For example, `{[α₁ ↦ {v₁ if φ₁, v₂ if φ₂}] [α₂ ↦ {v₃ if φ₃, v₄ if φ₄}]}` compactly represents |
| 176 | +many paths: `{[α₁ ↦ v₁, α₂ ↦ v₃] if φ₁ ∧ ¬φ₂ ∧ φ₃}`, `{[α₂ ↦ v₂, α₂ ↦ v₃] if ¬φ₁ ∧ φ₂ ∧ φ₃}`, |
| 177 | +`{[α₂ ↦ {v₁, v₂}, α₂ ↦ v₃] if φ₁ ∧ φ₂ ∧ φ₃}`, etc. |
| 178 | + |
| 179 | +The language of constraints in `φ` talks about (1) `λ`-tags of values and |
| 180 | +(2) referential equality. |
| 181 | + |
| 182 | +### Aliasing |
| 183 | + |
| 184 | +Two symbolic closures (with the same `λ`-tag) can be aliases of each other, |
| 185 | +which means if we execute one and get a `set!` effect at symbolic address `ρ₁[x]`, |
| 186 | +it may or may not reflect in symbolic address `ρ₂[x]` when we run the other closure for the result. |
| 187 | +For this reason, whenever we get back an effect `ρ₁[x] ↦ {v₁}`, we also include the conditional |
| 188 | +effect `ρ₂[x] ↦ {v₁ if ρ₁ = ρ₂}`. |
| 189 | + |
| 190 | +### Unbounded components in the concrete summarizing semantics, and their abstractions |
| 191 | + |
| 192 | +Two components can grow unboundedly: |
| 193 | +1. Transparent addresses with accumulated contexts: callers can keep attaching new contexts |
| 194 | + to callee's returned transparent addresses to distinguish results from different call sites. |
| 195 | +2. Symbolic addresses growing access paths (e.g. `ρ[x]->y->x->y->...`) |
| 196 | + |
| 197 | +We apply AAM/ADI-style finitization: |
| 198 | +1. For transparent addresses, we coudld either do fixed `k`, |
| 199 | + or approximate the list with a set (which is finite in any program). |
| 200 | + Note that we can't multiply analysis contexts no matter what choice. |
| 201 | + We only risk having poorly summarized results. |
| 202 | +2. For symbolic addresses, |
| 203 | + we can store-allocate the access paths to effectively summarize unbounded chains |
| 204 | + using a regular language, that reflects the program's structure. |
| 205 | + |
| 206 | +While abstract addresses with cardinality >1 cannot be refined in constraints, |
| 207 | +when instantiating a summary with them as arguments and a summary case specifies |
| 208 | +a completely disjoint set of values for the corresonding parameter, |
| 209 | +that's still a spurious case to be discarded. |
| 210 | + |
| 211 | +Possible Optimizations |
| 212 | +----------------------------------------- |
| 213 | + |
| 214 | +### Abstract garbage collection |
| 215 | + |
| 216 | +This work is very amenable to |
| 217 | +[stack-independent abstract GC](https://kimball.germane.net/germane-2019-stack-liberated.pdf) |
| 218 | +and reference counting to support |
| 219 | +strong updates of transparent addresses. |
| 220 | +The GC is done locally per function, over a relatively small store-delta and root set. |
| 221 | + |
| 222 | +### Lock-free parallelism |
0 commit comments