Skip to content

Commit 809cae7

Browse files
committed
First commit
1 parent 05d21dd commit 809cae7

File tree

3 files changed

+360
-0
lines changed

3 files changed

+360
-0
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*~

README.md

+222
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
Polymorphic Summaries for Abstract Definitional Interpreters
2+
=========================================
3+
4+
[AAM](https://dl.acm.org/doi/10.1145/1863543.1863553)- and
5+
[ADI](https://dl.acm.org/doi/10.1145/3110256)-based
6+
analyses' faithfulness to operational semantics brings substantial benefits
7+
(e.g. easy extension to expressive features with straightforward soundness proofs),
8+
but also seems to inevitably carry over one downside:
9+
each component is re-analyzed at each use, which either
10+
enjoys precision when it works well, or duplicates imprecision and multiplies overhead
11+
when it doesn't.
12+
The lack of function summaries is a key hindrance in scaling these analyses to large codebases.
13+
14+
This **in progress** Redex model describes a re-formulation of ADI to produce
15+
*polymorphic summaries*
16+
that enables a scalable and precise analysis of incomplete higher-order stateful programs,
17+
*using only a first-order language for summaries*.
18+
Summaries for smaller units (e.g. functions, modules, etc.) can be used as-is when analyzing
19+
arbitrary code without compromising soundness,
20+
but can also be "linked" together for a more precise summary,
21+
at a cost less than a from-scratch analysis.
22+
What's more, ADI naturally yields *memoized top-down*, as opposed to bottom-up summarization,
23+
avoiding the cost of analyzing the entire codebase when the entry point only reaches a
24+
small fraction of the code.
25+
26+
High-level Ideas
27+
-----------------------------------------
28+
29+
Initial idea: We want produce summaries by running each `λ` only once[^once],
30+
*symbolically*, agnostic to any caller.
31+
The *polymorphic summary* contains result, effect (e.g. "store delta"),
32+
and path-condition in terms of the symbolic arguments.
33+
Then at each call site, we instantiate the summary by substituting the actual arguments,
34+
potentially eliminating spurious paths.
35+
[^once]: Unless there's circular dependency, which is taken care of by ADI's cache-fixing loop.
36+
37+
The obvious problem with this idea is: *What should we do with symbolic functions?*
38+
39+
While it's tempting to try to come up with a language of "polymorphic, higher-order summaries"
40+
similar to a type-and-effect system that somehow is decidable without resorting to trivial
41+
imprecision, that's a very difficult route, especially for untyped languages and idioms.
42+
Even if we managed to do it for a fixed set of
43+
language features, the language of summaries would likely be far removed from the
44+
operational semantics, making soundness proof and extension to new features challenging.
45+
46+
Another choice that would result in pessimal precision, despite soundness, is to use
47+
[havoc](https://dl.acm.org/doi/10.1145/3158139)
48+
to over-approximate all interactions with unknown code whenever in doubt.
49+
For contract verification, the imprecision can be
50+
mitigated with more precise contracts at boundaries (e.g. parametric contracts),
51+
but that isn't viable for static analysis in general.
52+
53+
To combine *almost* the simplicity of mirroring the operational semantics,
54+
the scalability of function summaries,
55+
and the sound modularity from `havoc`, we make the following two observations:
56+
57+
1. Whole higher-order programs are essentially first-order programs
58+
(e.g. by defunctionalization).
59+
With the closed-world assumption, *higher-order functions only need first-order summaries*.
60+
61+
2. Incomplete programs are essentially whole programs, with unknown code filled in as `havoc`,
62+
with its own first-order summary.
63+
64+
### "Polymorphic", just not "parametric"
65+
66+
This work is inspired by the line of work on
67+
[polymorphic method summaries for whole OO programs](https://web.eecs.umich.edu/~xwangsd/pubs/aplas15.pdf),
68+
where it's valid to make the closed-world assumption that all implementations to each class
69+
are known. At each virtual method invocation, the analysis simply gathers the summaries
70+
of *all* known implementations,
71+
then uses information at the call-site to discard spurious cases
72+
or propagate constraints on the receiver's dynamic class tag.
73+
The obtained "polymorphic" summaries are only valid in the specific whole program they're in,
74+
and would be invalidated by additional method implementations.
75+
76+
For functional programs, this is analogous to analyzing their defunctionalized versions[^defunct].
77+
Specifically, each `(λ x e)` form has a first-order summary of `e`'s result and effect
78+
(as a *store delta*), with respect to a *symbolic environment* ranging over `e`'s free variables.
79+
Each application's results and effects are gathered from the summaries of all known `λ`s,
80+
each guarded by a constraint over the target closure's "`λ`-tag".
81+
As long as the program is closed, the summaries are sound.
82+
Summary instantiation is a straight-forward substitution of the symbolic environment with
83+
the actual one from the target closure, with no further call to evaluation.
84+
Summary instantiation is sensitive in the closure's `λ`-tag, its environment, and the argument.
85+
86+
[^defunct]: Although we don't need to explicitly defunctionalize the program.
87+
88+
Note that there is no such thing as "applying a symbolic function then obtaining its symbolic result
89+
and effect" in this system, thanks to the "defunctionalized view".
90+
When "in doubt", we case-split over all known `λ`s then use respective first-order summaries
91+
guarded by constraints over the `λ`-tag.
92+
93+
Although it sounds expensive to consider all `λ`s in many cases, remember that we are
94+
pumping memoized summaries around, not re-running analyses.
95+
96+
### Removing the closed-world assumption to enable modularity
97+
98+
Whole-program analyses are useful, and separating the *scalability* goal from the stronger
99+
*modularity* goal is neat, but we sometimes need modularity, such as verifying a component
100+
once and for all against arbitrary uses.
101+
102+
Luckily, there's
103+
[previous work on soundly over-approximating unknown code in the presence of
104+
higher-order functions and mutable states](https://dl.acm.org/doi/10.1145/3158139)
105+
using an operation called `havoc`.
106+
To generalize function summaries for incomplete programs, we simply add `havoc`'s summaries
107+
for use when (1) applying a closure whose `λ`-tag is from the unknown,
108+
and (2) when a `λ` from one module flows to another module that doesn't already "know" it.
109+
As a result, we get *summaries that are sound even in an open world,
110+
and also precise for known code*.
111+
112+
When analyzing code that uses both modules `m1` and `m2`, we could either re-use their
113+
separately produced summaries as-is, or "link"[^link] those summaries by
114+
re-running ADI's cache-fixing loop over the merged summaries and produce a more precise
115+
summary of both `m1` and `m2`.
116+
117+
[^link]:The term "linking" only conveys a loose analogy, unfortunately:
118+
we still need `m1` and `m2`'s source code to re-run ADI. But the process is cheaper
119+
than re-running ADI from scratch.
120+
121+
### ADI for memoized top-down instead of bottom-up
122+
123+
Most work on function summaries present the process as eager bottom-up. As well understood from
124+
dynamic programming, one drawback compared to (memoized) top-down is the potentially needless
125+
computations when only a fraction of the codebase is reachable from the entry point.
126+
127+
ADI naturally carries out the semantics top-down memoized, starting from the entry
128+
point. We track the `λ`s whose closures have been created, and only consider summaries for those,
129+
on-demand, at applications. This potentially requires more iterations to reach a fix-point,
130+
but can save lots of work when the reachable code is small compared to the entire codebase.
131+
132+
The Redex Model (NOT FINISHED!!)
133+
-----------------------------------------
134+
135+
The Redex model is for a *modified concrete semantics that summarizes each function as
136+
first-order values and store-deltas*.
137+
This semantics should compute the same result as the standard operational semantics[^proof],
138+
and can be systematically finitized (ala AAM/ADI) to obtain a static analysis.
139+
The model is minimal and omits `havoc`.
140+
The language is λ-calculus with `set!`.
141+
142+
[^proof]: TODO needs proof. But this can be done once, then all abstractions by finitization and non-determinism should be sound.
143+
144+
### *Symbolic* vs *transparent*
145+
146+
There's a distinction between *symbolic* and *transparent* addresses `α` and values `v`.
147+
When analyzing a function:
148+
* A *symbolic address* stands for one that is passed in from an arbitrary caller.
149+
All values and addresses initially reachable from it are also symbolic.
150+
No aliasing information is assumed.
151+
* A *transparent address* is one allocated either directly within the function's body
152+
or indirectly from its callees. Aliasing among transparent addresses are always
153+
soundly tracked and approximated. In particular, transparent addresses cannot be aliased
154+
by symbolic addresses, by construction.
155+
156+
Unlike standard ADI, this system:
157+
* only memoizes function bodies (as opposed to all sub-expressions)
158+
* memoizes symbolic results to be instantiated at each call site
159+
(as opposed to already context-sensitive results to be returned as-is).
160+
161+
In this formalism, application is the only place where we allocate a transparent address.
162+
Given the callee's result, we specialize it by:
163+
* substituting its symbolic addresses with the caller's ones (with ``)
164+
* consistently renaming its transparent addresses by attaching the caller's context ``
165+
(with `tick`; the name is inspired by AAM's work, but its use is in contrast to classic AAM
166+
and ADI that propagates contexts from callers to callees.)
167+
* doing the obvious constraint propagation, path pruning, and effect joining (with `⊔σ`).
168+
169+
### Path-condition representation
170+
171+
Instead of representing conditions per-path, we guard values (and therefore chunks
172+
of effect) with smaller chunks of the path-condition, and take their conjunction at appropriate
173+
places. This representation prevents eager splitting and duplicating of many constructs.
174+
175+
For example, `{[α₁ ↦ {v₁ if φ₁, v₂ if φ₂}] [α₂ ↦ {v₃ if φ₃, v₄ if φ₄}]}` compactly represents
176+
many paths: `{[α₁ ↦ v₁, α₂ ↦ v₃] if φ₁ ∧ ¬φ₂ ∧ φ₃}`, `{[α₂ ↦ v₂, α₂ ↦ v₃] if ¬φ₁ ∧ φ₂ ∧ φ₃}`,
177+
`{[α₂ ↦ {v₁, v₂}, α₂ ↦ v₃] if φ₁ ∧ φ₂ ∧ φ₃}`, etc.
178+
179+
The language of constraints in `φ` talks about (1) `λ`-tags of values and
180+
(2) referential equality.
181+
182+
### Aliasing
183+
184+
Two symbolic closures (with the same `λ`-tag) can be aliases of each other,
185+
which means if we execute one and get a `set!` effect at symbolic address `ρ₁[x]`,
186+
it may or may not reflect in symbolic address `ρ₂[x]` when we run the other closure for the result.
187+
For this reason, whenever we get back an effect `ρ₁[x] ↦ {v₁}`, we also include the conditional
188+
effect `ρ₂[x] ↦ {v₁ if ρ₁ = ρ₂}`.
189+
190+
### Unbounded components in the concrete summarizing semantics, and their abstractions
191+
192+
Two components can grow unboundedly:
193+
1. Transparent addresses with accumulated contexts: callers can keep attaching new contexts
194+
to callee's returned transparent addresses to distinguish results from different call sites.
195+
2. Symbolic addresses growing access paths (e.g. `ρ[x]->y->x->y->...`)
196+
197+
We apply AAM/ADI-style finitization:
198+
1. For transparent addresses, we coudld either do fixed `k`,
199+
or approximate the list with a set (which is finite in any program).
200+
Note that we can't multiply analysis contexts no matter what choice.
201+
We only risk having poorly summarized results.
202+
2. For symbolic addresses,
203+
we can store-allocate the access paths to effectively summarize unbounded chains
204+
using a regular language, that reflects the program's structure.
205+
206+
While abstract addresses with cardinality >1 cannot be refined in constraints,
207+
when instantiating a summary with them as arguments and a summary case specifies
208+
a completely disjoint set of values for the corresonding parameter,
209+
that's still a spurious case to be discarded.
210+
211+
Possible Optimizations
212+
-----------------------------------------
213+
214+
### Abstract garbage collection
215+
216+
This work is very amenable to
217+
[stack-independent abstract GC](https://kimball.germane.net/germane-2019-stack-liberated.pdf)
218+
and reference counting to support
219+
strong updates of transparent addresses.
220+
The GC is done locally per function, over a relatively small store-delta and root set.
221+
222+
### Lock-free parallelism

src/redex/lang.rkt

+137
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
#lang racket
2+
(provide (all-defined-out))
3+
(require redex)
4+
5+
(define-language L
6+
;; Syntax
7+
[#|Expressions |# e ::= x l (e e) (set! x e)]
8+
[#|Function literals|# l ::= (λ x e)]
9+
[#|Variables |# x y z ::= variable-not-otherwise-mentioned]
10+
[#|Source Locations |# ℓ ::= integer]
11+
12+
;; Semantics
13+
[#|Values |# u ::= #|Transparent|# (Clo l ρ)
14+
#|Symbolic |# (↝ x ...)]
15+
[#|Abstract Values |# v ::= {(u if φ) ...}]
16+
[#|Environments |# ρ ::= {[x ↦ α] ...}]
17+
[#|Addresses |# α ::= #|Transaprent|# (x @ ℓ ...)
18+
#|Symbolic |# (↝ x ...)]
19+
[#|Stores |# σ ::= {[α ↝ v] ...}]
20+
[#|Path conditions |# φ ::= {[u : l] ...}]
21+
[ ?φ ::= φ #f]
22+
[#|Summary Tables |# Σ ::= {[l ⇒ s] ...}] ; ≃ "memo-table" in original ADI
23+
[#|Summaries |# s ::= (v σ) #f])
24+
25+
(define-judgment-form L
26+
#:contract (⊢ Σ : Σ σ : e ⇓ v σ Σ)
27+
#:mode (⊢ I I I I I I I O O O)
28+
29+
;; `(x @)` is always the symbolic address variable `x` maps to when in scope
30+
[-----------------------------------------"ev-var"
31+
(⊢ _ : Σ σ : x ⇓ (lookup-σ σ (↝ x)) σ Σ)]
32+
33+
[(where {x ...} (fv l))
34+
(where ρ {[x ↦ (↝ x)] ...})
35+
-----------------------------------------"ev-lam"
36+
(⊢ _ : Σ σ : l ⇓ (Clo l ρ) σ Σ)]
37+
38+
[(⊢ Σ_in : Σ_0 σ_0 : e ⇓ v σ_1 Σ_1)
39+
-----------------------------------------"ev-set"
40+
(⊢ Σ_in : Σ_0 σ_0 : (set! x e) ⇓ v (⊔σ σ_1 {[(↝ x) ↝ v]}) Σ_1)]
41+
42+
[(⊢ Σ_in : Σ_0 σ_0 : e_1 ⇓ v_1 σ_1 Σ_1)
43+
(⊢ Σ_in : Σ_1 σ_1 : e_2 ⇓ v_2 σ_2 Σ_2)
44+
(where ℓ ,(ℓ! (term (e_1 e_2))))
45+
(⊢ᵃ* Σ_in : Σ_2 σ_2 ℓ : v_1 v_2 ⇓ v_a σ_a Σ_a)
46+
-----------------------------------------"ev-app"
47+
(⊢ Σ_in : Σ_0 σ_0 : (e_1 e_2) ⇓ v_a σ_a Σ_a)]
48+
)
49+
50+
(define-judgment-form L
51+
#:contract (⊢ᵃ* Σ : Σ σ ℓ : v v ⇓ v σ Σ)
52+
#:mode (⊢ᵃ* I I I I I I I I I O O O)
53+
54+
[-----------------------------------------"app-none"
55+
(⊢ᵃ* _ : Σ σ _ : {} _ ⇓ {} σ Σ)]
56+
57+
[(where v_x* (refine-v v_x φ))
58+
(⊢ᵃ Σ_in : Σ_0 σ_0 ℓ : u v_x* ⇓ v_1 σ_1 Σ_1)
59+
(⊢ᵃ* Σ_in : Σ_1 σ_0 ℓ : {any_i ...} v_x ⇓ v_i σ_i Σ_i)
60+
-----------------------------------------"app-some"
61+
(⊢ᵃ* Σ_in : Σ_0 σ_0 ℓ : {(u if φ) any_i ...} v_x ⇓ (⊔v v_1 v_x) (⊔σ σ_1 σ_i) Σ_i)])
62+
63+
(define-judgment-form L
64+
#:contract (⊢ᵃ Σ : Σ σ ℓ : u v ⇓ v σ Σ)
65+
#:mode (⊢ᵃ I I I I I I I I I O O O)
66+
67+
[(⊢/memo Σ_in : Σ_0 : l ⇓ v_ee σ_ee Σ_1)
68+
(where α_x (x @))
69+
(where σ_er* (⊔σ σ_er {[α_x ↝ v]}))
70+
(where ρ_* (⧺ ρ {[x ↦ α_x]}))
71+
(⊢ᵛ σ_er* ρ_* : (tick-v ℓ v_ee) ⟹ v_a)
72+
(⊢ˢ σ_er* ρ_* : (tick-σ ℓ σ_ee) ⟹ σ_a)
73+
-----------------------------------------"app-lam"
74+
(⊢ᵃ Σ_in : Σ_0 σ_er ℓ : (Clo l ρ) v ⇓ v_a (⊔σ σ_er* σ_a) Σ_1)]
75+
76+
[(where {_ ... [(name l (λ x e)) ⇒ _] _ ...} Σ_0)
77+
(where v_x (refine-v v {[z : l]}))
78+
(⊢/memo Σ_in : Σ_0 : l ⇓ v_ee σ_ee Σ_1)
79+
(where α_x (x @))
80+
(where σ_er* (⊔σ σ_er {[α_x ↝ v_x]}))
81+
(where {y ...} (fv l))
82+
(where ρ_* {[y ↦ (↝ z ... y)] ... [x ↦ α_x]})
83+
(⊢ᵛ σ_er* ρ_* : (reroot-v z (tick-v ℓ v_ee)) ⟹ v_a)
84+
(⊢ˢ σ_er* ρ_* : (reroot-σ z (tick-σ ℓ σ_ee)) ⟹ σ_a)
85+
-----------------------------------------"app-sym"
86+
(⊢ᵃ Σ_in : Σ_0 σ_er ℓ : (↝ z ...) v ⇓ v_a (⊔σ σ_er* σ_a) Σ_1)]
87+
)
88+
89+
(define-judgment-form L
90+
#:contract (⊢ᵛ σ ρ : v ⟹ v)
91+
#:mode (⊢ᵛ I I I I I O)
92+
93+
[-----------------------------------------"btm"
94+
(⊢ᵛ _ _ : {} ⟹ {})]
95+
96+
[(⊢ᵠ σ ρ : φ ⟹ φ_1)
97+
(⊢ᵘ σ ρ : u ⟹ u_1)
98+
(⊢ᵛ σ ρ : {any_i ...} ⟹ v_i)
99+
-----------------------------------------"join"
100+
(⊢ᵛ σ ρ : {(u if φ) any_i ...} ⟹ (⊔v {(u_1 if φ_1)} v_i))]
101+
102+
[(⊢ᵠ σ ρ : φ ⟹ #f)
103+
(⊢ᵘ σ ρ : u ⟹ u_1)
104+
(⊢ᵛ σ ρ : {any_i ...} ⟹ v_i)
105+
-----------------------------------------"drop"
106+
(⊢ᵛ σ ρ : {(u if φ) any_i ...} ⟹ v_i)]
107+
)
108+
109+
(define-judgment-form L
110+
#:contract (⊢ᵠ σ ρ : φ ⟹ ?φ)
111+
#:mode (⊢ᵠ I I I I I O)
112+
[(⊢ᵘ σ ρ : u ⟹ u_1) ...
113+
-----------------------------------------
114+
(⊢ᵠ σ ρ : {[u : l] ...} ⟹ (refine-φ {[u_1 : l]} ...))])
115+
116+
(define-judgment-form L
117+
#:contract (⊢ˢ σ ρ : σ ⟹ σ)
118+
#:mode (⊢ˢ I I I I I O)
119+
[(⊢ᵅ σ ρ : α ⟹ α_1) ...
120+
(⊢ᵘ σ ρ : u ⟹ u_1) ...
121+
-----------------------------------------
122+
(⊢ˢ σ ρ : {[α ↝ v] ...} ⟹ (⊔σ {[α_1 ↝ u_1]} ...))])
123+
124+
(define-judgment-form L
125+
#:contract (⊢ᵅ σ ρ : α ⟹ α)
126+
#:mode (⊢ᵅ I I I I I O)
127+
128+
[-----------------------------------------"free"
129+
(⊢ᵅ _ _ : (name α (x @ _ ...)) ⟹ α)]
130+
131+
[-----------------------------------------"bound"
132+
(⊢ᵅ σ {_ ... [x ↦ α] _ ...} : (↝ x) ⟹ α)]
133+
134+
[
135+
-----------------------------------------"chase"
136+
(⊢ᵅ σ ρ : (↝ x y ...) ⟹ )]
137+
)

0 commit comments

Comments
 (0)