[ty] Add constraint set implementation#19997
Conversation
|
I'm excited to see how much slower this is than |
Diagnostic diff on typing conformance testsNo changes detected when running ty on typing conformance tests ✅ |
|
594112e to
cba78ca
Compare
CodSpeed WallTime Performance ReportMerging #19997 will not alter performanceComparing Summary
|
e34afe1 to
d5c49ba
Compare
c1441d2 to
c70e3bc
Compare
c70e3bc to
f06c0f6
Compare
|
This is ready for review! #20093 is the real proof that this representation works well. In some ways, this PR is just a setup for that, even though we're introducing a pretty complex new data structure here. |
AlexWaygood
left a comment
There was a problem hiding this comment.
This looks cool! I haven't done a deep review of the code for correctness -- this is mainly a docs review :-)
|
Should this PR still have "WIP" in its title? 😄 |
Nope! Removed |
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
* main: Fix mdtest ignore python code blocks (#20139) [ty] add support for cyclic legacy generic protocols (#20125) [ty] add cycle detection for find_legacy_typevars (#20124) Use new diff rendering format in tests (#20101) [ty] Fix 'too many cycle iterations' for unions of literals (#20137) [ty] No boundness analysis for implicit instance attributes (#20128) Bump 0.12.11 (#20136) [ty] Benchmarks for problematic implicit instance attributes cases (#20133) [`pyflakes`] Fix `allowed-unused-imports` matching for top-level modules (`F401`) (#20115) Move GitLab output rendering to `ruff_db` (#20117) [ty] Evaluate reachability of non-definitely-bound to Ambiguous (#19579) [ty] Introduce a representation for the top/bottom materialization of an invariant generic (#20076) [`flake8-async`] Implement `blocking-http-call-httpx` (`ASYNC212`) (#20091) [ty] print diagnostics with fully qualified name to disambiguate some cases (#19850) [`ruff`] Preserve relative whitespace in multi-line expressions (`RUF033`) (#19647)
There was a problem hiding this comment.
This is really clean and elegant!
A few thoughts, none of them blocking this PR:
- There will be a lot of
has_relation_tochecks where we collect constraints but never evaluate them for anything other than always-satisfied or never-satisfied. Will there be opportunities to improve performance on those checks if we know up-front that all we care about is always, or all we care about is never? It seems like it could potentially allow us to short-circuit a lot of work. (Something to explore in a future PR, not now.) - My recursion spidey-sense tingles a bit about the fact that we use a
ConstraintSetin evaluatinghas_relation_to, and building aConstraintSetinvolves a lot ofis_subtype_ofchecks on upper and lower bound types. Is there potential for stack overflow here? Do we need anything additional to prevent that? Can we get into a situation where evaluating a subtype relation causes us to build a constraint set that requires evaluating the original subtype relation? If so, our CycleDetector onhas_relation_towouldn't help, because it would be separateis_subtype_ofchecks. - Possibly related to (2): the spec says typevar bounds/constraints cannot be generic, but there's been recent discussion of lifting that requirement, and it sounds like Pyrefly will experiment with that. It seems to me that we're well-positioned for that as well (you'd just end up adding constraints on the nested typevar, too), but maybe something to consider.
| // If two clauses cancel out to 0, that does NOT cause the entire set to become | ||
| // 0. We need to keep whatever clauses have already been added to the result, | ||
| // and also need to copy over any later clauses that we hadn't processed yet. | ||
| self.clauses.extend(existing_clauses); | ||
| return; |
There was a problem hiding this comment.
Correctness here depends on the invariant that a single new clause can only ever simplify-to-never with one existing clause (i.e. it can't cancel out two different existing clauses.) How do we know that to be the case here? Below with the Simplified case, in contrast, we explicitly handle the possibility that the new clause may simplify with a later clause.
There was a problem hiding this comment.
Any existing clauses must already be simplified relative to each other. So I think that for a new clause to cancel out more than one existing clause, it would have to do it in multiple steps, in a confluent way. So the new clause would "partially" simplify against the first existing clause that we encounter (i.e. simplify a bit but not all the way to 0). (That would trigger the Simplified branch below, where we carry the simplified result over to check against later existing clauses.) Then that partially simplified clause would simplify the "rest of the way" to 0 when we encounter the second (relevant) existing clause. And the "confluent" part means that it would need to happen regardless of the order that the two existing clauses appear in the original result.
I have not done a proof that ☝️ holds, but that's my intuition for why it should™ work.
| // # `1 ∪ (T ≤ int)` | ||
| // # simplifies via saturation to | ||
| // # `T ≤ int` | ||
| // x: A[U] | A[V] |
There was a problem hiding this comment.
I could be missing something here: I can see how we can abstractly say that these constraints apply here, but concretely I don't think this code would ever result in us creating a ConstraintSet at all? There is no assignment here (where we'd create a ConstraintSet ephemerally in has_relation_to_impl, just in order to check if its always_satisfiable), nor is there a call to a generic function or constructor, where we'd create a ConstraintSet across multiple assignability checks (for each argument) and then solve it in order to generate a specialization.
I think to the extent that there is value in having Python examples (I'm not convinced that it's useful in code at this level of abstraction), it should ideally be examples where we would actually have to exercise the code in question in order to arrive at a correct type-checking answer in the Python example. I'm not quite seeing that in these examples; they are more like re-stating the set theory with a different syntax.
That said, I also don't think we should spend more time right now on improving these examples, so I'm fine leaving them as-is; this is more of a thought for future.
| if self.subsumes_via_intersection(db, &other) { | ||
| return Simplifiable::Simplified(other); | ||
| } | ||
| if other.subsumes_via_intersection(db, &self) { | ||
| return Simplifiable::Simplified(self); | ||
| } |
There was a problem hiding this comment.
We do this via two merged iterations, but I think it can be easily done with a single iteration and a tri-valued return?
Maybe doesn't matter in practice, depends how hot this ends up being in practice, and how many multi-constraint clauses we see.
There was a problem hiding this comment.
Added a TODO to consider this
Co-authored-by: Carl Meyer <carl@astral.sh>
dcreager
left a comment
There was a problem hiding this comment.
There will be a lot of
has_relation_tochecks where we collect constraints but never evaluate them for anything other than always-satisfied or never-satisfied. Will there be opportunities to improve performance on those checks if we know up-front that all we care about is always, or all we care about is never? It seems like it could potentially allow us to short-circuit a lot of work. (Something to explore in a future PR, not now.)
I don't think this would give correct results. This is related to my comment from last week #19838 (comment), and you can see it in the draft of #20093. In that PR I've moved around the non-inferrable typevar match arms in has_relation_to, because we no longer have to be careful about doing some typevar checks before we handle the connectives, and others after. We can rely on how we combine the constraints from the recursive calls to let partially satisfiable recursive constraint sets either (a) "build up" towards 1, or (b) "cancel out" towards 0. Doing so requires having the full constraint sets available, so that we can look at their structure to see what they do when unioned or intersected together. Doing that on bool loses that detail, leading to wrong answers.
My recursion spidey-sense tingles a bit about the fact that we use a
ConstraintSetin evaluatinghas_relation_to, and building aConstraintSetinvolves a lot ofis_subtype_ofchecks on upper and lower bound types. Is there potential for stack overflow here? Do we need anything additional to prevent that? Can we get into a situation where evaluating a subtype relation causes us to build a constraint set that requires evaluating the original subtype relation? If so, our CycleDetector onhas_relation_towouldn't help, because it would be separateis_subtype_ofchecks.
If the bounds of a constraint don't contain any typevars (a "concrete" type), then I think we're okay, since calculating subtyping of two concrete types can only produce true, false, and combinations of those. (If there are no typevars in the type, then there's nothing to create an AtomicConstraint for.) And so we never hit any of the new logic for combining and simplifying constraints.
If there are bounds that do contain typevars, we do have to worry about this — and the way POPL15 etc solve this is by introducing an ordering on typevars, and saying that typevar bounds can only reference other typevars that are smaller according to that ordering. That ensures that you don't get cycles in the "bounds graph". I figure we'll just use Salsa IDs as our ordering when we get to that part.
Possibly related to (2): the spec says typevar bounds/constraints cannot be generic, but there's been recent discussion of lifting that requirement, and it sounds like Pyrefly will experiment with that. It seems to me that we're well-positioned for that as well (you'd just end up adding constraints on the nested typevar, too), but maybe something to consider.
I think we will already have to support typevars that have constraints involving other typevars, to handle things like calling a generic function (and inferring its specialization) from inside another (such that the constraints of the calling function are needed to figure out the valid specializations of the called function). So at that point it should be no problem to have typevar bounds mention other typevars, since that would just translate into a constraint that can already contain other typevars. (Modulo the bit above about using an artificial ordering to keep the bounds graph acyclic.)
| // If two clauses cancel out to 0, that does NOT cause the entire set to become | ||
| // 0. We need to keep whatever clauses have already been added to the result, | ||
| // and also need to copy over any later clauses that we hadn't processed yet. | ||
| self.clauses.extend(existing_clauses); | ||
| return; |
There was a problem hiding this comment.
Any existing clauses must already be simplified relative to each other. So I think that for a new clause to cancel out more than one existing clause, it would have to do it in multiple steps, in a confluent way. So the new clause would "partially" simplify against the first existing clause that we encounter (i.e. simplify a bit but not all the way to 0). (That would trigger the Simplified branch below, where we carry the simplified result over to check against later existing clauses.) Then that partially simplified clause would simplify the "rest of the way" to 0 when we encounter the second (relevant) existing clause. And the "confluent" part means that it would need to happen regardless of the order that the two existing clauses appear in the original result.
I have not done a proof that ☝️ holds, but that's my intuition for why it should™ work.
| if self.subsumes_via_intersection(db, &other) { | ||
| return Simplifiable::Simplified(other); | ||
| } | ||
| if other.subsumes_via_intersection(db, &self) { | ||
| return Simplifiable::Simplified(self); | ||
| } |
There was a problem hiding this comment.
Added a TODO to consider this
This PR adds an implementation of constraint sets.
An individual constraint restricts the specialization of a single
typevar to be within a particular lower and upper bound: the typevar can
only specialize to types that are a supertype of the lower bound, and a
subtype of the upper bound. (Note that lower and upper bounds are fully
static; we take the bottom and top materializations of the bounds to
remove any gradual forms if needed.) Either bound can be “closed” (where
the bound is a valid specialization), or “open” (where it is not).
You can then build up more complex constraint sets using union,
intersection, and negation operations. We use a disjunctive normal form
(DNF) representation, just like we do for types: a _constraint set_ is
the union of zero or more _clauses_, each of which is the intersection
of zero or more individual constraints. Note that the constraint set
that contains no clauses is never satisfiable (`⋃ {} = 0`); and the
constraint set that contains a single clause, which contains no
constraints, is always satisfiable (`⋃ {⋂ {}} = 1`).
One thing to note is that this PR does not change the logic of the
actual assignability checks, and in particular, we still aren't ever
trying to create an "individual constraint" that constrains a typevar.
Technically we're still operating only on `bool`s, since we only ever
instantiate `C::always_satisfiable` (i.e., `true`) and
`C::unsatisfiable` (i.e., `false`) in the `has_relation_to` methods. So
if you thought that astral-sh#19838 introduced an unnecessarily complex stand-in
for `bool`, well here you go, this one is worse! (But still seemingly
not yielding a performance regression!) The next PR in this series,
astral-sh#20093, is where we will actually create some non-trivial constraint
sets and use them in anger.
That said, the PR does go ahead and update the assignability checks to
use the new `ConstraintSet` type instead of `bool`. That part is fairly
straightforward since we had already updated the assignability checks to
use the `Constraints` trait; we just have to actively choose a different
impl type. (For the `is_whatever` variants, which still return a `bool`,
we have to convert the constraint set, but the explicit
`is_always_satisfiable` calls serve as nice documentation of our
intent.)
---------
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: Carl Meyer <carl@astral.sh>
…#20306) The constraint representation that we added in #19997 was subtly wrong, in that it didn't correctly model that type assignability is a _partial_ order — it's possible for two types to be incomparable, with neither a subtype of the other. That means the negation of a constraint like `T ≤ t` (typevar `T` must be a subtype of `t`) is **_not_** `t < T`, but rather `t < T ∨ T ≁ t` (using ≁ to mean "not comparable to"). That means we need to update our constraint representation to be an enum, so that we can track both _range_ constraints (upper/lower bound on the typevar), and these new _incomparable_ constraints. Since we need an enum now, that also lets us simplify how we were modeling range constraints. Before, we let the lower/upper bounds be either open (<) or closed (≤). Now, range constraints are always closed, and we add a third kind of constraint for _not equivalent_ (≠). We can translate an open upper bound `T < t` into `T ≤ t ∧ T ≠ t`. We already had the logic for doing adding _clauses_ to a _set_ by doing a pairwise simplification. We copy that over to where we add _constraints_ to a _clause_. To calculate the intersection or union of two constraints, the new enum representation makes it easy to break down all of the possibilities into a small number of cases: intersect range with range, intersect range with not-equivalent, etc. I've done the math [here](https://dcreager.net/theory/constraints/) to show that the simplifications for each of these cases is correct.
This PR adds an implementation of constraint sets.
An individual constraint restricts the specialization of a single typevar to be within a particular lower and upper bound: the typevar can only specialize to types that are a supertype of the lower bound, and a subtype of the upper bound. (Note that lower and upper bounds are fully static; we take the bottom and top materializations of the bounds to remove any gradual forms if needed.) Either bound can be “closed” (where the bound is a valid specialization), or “open” (where it is not).
You can then build up more complex constraint sets using union, intersection, and negation operations. We use a disjunctive normal form (DNF) representation, just like we do for types: a constraint set is the union of zero or more clauses, each of which is the intersection of zero or more individual constraints. Note that the constraint set that contains no clauses is never satisfiable (
⋃ {} = 0); and the constraint set that contains a single clause, which contains no constraints, is always satisfiable (⋃ {⋂ {}} = 1).One thing to note is that this PR does not change the logic of the actual assignability checks, and in particular, we still aren't ever trying to create an "individual constraint" that constrains a typevar. Technically we're still operating only on
bools, since we only ever instantiateC::always_satisfiable(i.e.,true) andC::unsatisfiable(i.e.,false) in thehas_relation_tomethods. So if you thought that #19838 introduced an unnecessarily complex stand-in forbool, well here you go, this one is worse! (But still seemingly not yielding a performance regression!) The next PR in this series, #20093, is where we will actually create some non-trivial constraint sets and use them in anger.That said, the PR does go ahead and update the assignability checks to use the new
ConstraintSettype instead ofbool. That part is fairly straightforward since we had already updated the assignability checks to use theConstraintstrait; we just have to actively choose a different impl type. (For theis_whatevervariants, which still return abool, we have to convert the constraint set, but the explicitis_always_satisfiablecalls serve as nice documentation of our intent.)