[ty] Use distributed versions of AND and OR on constraint sets#22614
[ty] Use distributed versions of AND and OR on constraint sets#22614
Conversation
Typing conformance resultsNo changes detected ✅ |
|
2fc3eac to
13ad151
Compare
| /// In particular, we use the IDs that salsa assigns to each constraint as it is created. This | ||
| /// tends to ensure that constraints that are close to each other in the source are also close | ||
| /// to each other in the BDD structure. |
There was a problem hiding this comment.
While we're here, I'm updating the BDD variable ordering to be less clever. For the pathological example from #21902 this has a huge benefit.
There was a problem hiding this comment.
Should we do this in a separate PR so that we better understand where the performance improvements are coming from?
There was a problem hiding this comment.
Done: #22777
(I have not addressed the other comments below yet; did this first to see what the performance looks like before considering a fallback for smaller vecs/etc)
| └─₀ <5> (U = str) 1/4 | ||
| ┡━₁ <2> SHARED |
There was a problem hiding this comment.
and also update the tree display to make it more obvious where we're sharing tree structure
CodSpeed Performance ReportMerging this PR will not alter performanceComparing Summary
Footnotes
|
|
From the summary, I expect this to improve performance and reduce memory usage. Both don't seem to be the case. Do you have an understanding where the performance regression comes from? Could we keep using the "old approach" when not dealing with many overloads? |
| } | ||
| } | ||
| result | ||
| let inputs: Vec<_> = self.map(|element| f(element).node).collect(); |
There was a problem hiding this comment.
Could it be that the collect calls are expensive? Given that distributed_or and distributed_and are very small methods, would it make sense to pass the f through and apply the mapping in distributed_or/and?
Another alternative is to use a SmallVec instead. But I wonder if part of the perf regression simply comes from writing all the constraints to a vec
There was a problem hiding this comment.
I've pushed up a new version that doesn't collect into a vec first. I want to see how that affects the perf numbers; if it works well I plan to add some better documentation comments describing how it works
There was a problem hiding this comment.
This seems to work well! Pushed up some documentation of the approach.
Me too! Maybe the collecting is the culprit, as you suggest? Or maybe because we're not short circuiting anymore? I have an idea that might help with both. |
|
Wow, nice job getting this from a 5x perf regression to a 4% perf improvement!! |
02ef331 to
7cf9fe2
Compare
MichaReiser
left a comment
There was a problem hiding this comment.
Nice.
It might make sense to specialize distribute_or and distribute_and (or when_any) for &[T] and ExactSizeIterator as we see a perf regression on many projects (while small). Unless the perf regression is related to the ordering change. I suggest splitting that change into its own PR so that we have a better understanding where the regression is coming from.
| } | ||
| } | ||
| result | ||
| let node = Node::distributed_or(db, self.map(|element| f(element).node)); |
There was a problem hiding this comment.
This PR does improve the performance by a fair bit but it also regresses performance by about 1-2 on most other projects.
It would be lovely if we could use specialization to specialize when_any and when_all for ExactSizeIterator so that we could use the old implementation if there are only very few items. But, that's unlikely an option any time soon unless we migrate to nightly Rust.
I went through some when_any usages and:
- We could implement
when_anyfor&[T]and&FxOrderSet - We could add a
when_any_exactmethod (or renamewhen_anytowhen_any_iterto advertise theExactSizeIteratorversion)
There was a problem hiding this comment.
An ExactSizeIterator is required to return the exact size from size_hint, too, so I added a fallback that checks if the max size hint is <= 4, and uses the old implementation if so.
There was a problem hiding this comment.
(That way I didn't have to worry about specialization or adding a new method for exact-sized things)
| /// In particular, we use the IDs that salsa assigns to each constraint as it is created. This | ||
| /// tends to ensure that constraints that are close to each other in the source are also close | ||
| /// to each other in the BDD structure. |
There was a problem hiding this comment.
Should we do this in a separate PR so that we better understand where the performance improvements are coming from?
afc3e00 to
ffd9920
Compare
|
Looks like the last commit regressed performance again? |
| /// | ||
| /// ```text | ||
| /// linear: (((((a ∨ b) ∨ c) ∨ d) ∨ e) ∨ f) ∨ g | ||
| /// tree: ((a ∨ b) ∨ (c ∨ d)) ∨ ((e ∨ f) ∨ g) |
There was a problem hiding this comment.
It's unclear to me why this is algorithmically cheaper. If the constraint sets are all disjoint, this is equivalent (modulo node ordering, which we can really control here). The worst case is also equivalent. In any case, both BDD constructions operate on the same set of pairs of nodes — so we end up ORing two medium sized constraint sets instead of a large constraint set with a small one, which should end up being equivalent? I may be missing something here, given that the benchmarks agree this is an improvement.
There was a problem hiding this comment.
My analysis is admittedly a bit hand-wavy:
If the constraints are all disjoint, then the size of each constraint set (i.e. the number of internal nodes) grows linearly: |a| = 1, |ab| = 2, |abc| = 3, etc.
Each time we invoke a BDD operator, it costs something like O(m + n) time.
So with the linear shape (before this PR), you'd end up with a total cost of
(((((a ∨ b) ∨ c) ∨ d) ∨ e) ∨ f) ∨ g
ab 2
ab c 3
abc d 4
abcd e 5
abcde f 6
abcdef g 7
27
and with the tree shape, you get
((a ∨ b) ∨ (c ∨ d)) ∨ ((e ∨ f) ∨ g)
ab 2
cd 2
ab cd 4
ef 2
ef g 3
abcd efg 7
20
I think that works out to an overall cost of O(n^2) for the linear shape, and O(n log n) for the tree shape.
There was a problem hiding this comment.
Hmm. I suspect there is something more nuanced here, which is that if a, b, c, d, ... are in source order (i.e., a is the lowest-order variable in the BDD ordering), then applying (((((a ∨ b) ∨ c) ∨ d) ∨ e) ∨ f) ∨ g is O(1) at each step (after #22777 and assuming disjointness). Even assuming an arbitrary order, the probability that the final step has to traverse the entire LHS is the probability that g is the lowest-order variable, while in the distributed approach, you are required to at least traverse one side of the tree entirely, and even the best case has quite low probability (that all the nodes on the LHS are higher-order than the RHS, or vice versa).
That being said, I'm happy to defer to the benchmark results here.
| // number of elements seen so far. To do that, whenever the last two elements of the vector | ||
| // have the same depth, we apply the operator once to combine those two elements, adding | ||
| // the result back to the vector with an incremented depth. (That might let us combine the | ||
| // result with the _next_ intermediate result in the vector, and so on.) |
There was a problem hiding this comment.
Doesn't this assume the input constraint sets are of equal or similar size (though I suppose that is mostly true currently)? Should we try sorting by size here?
There was a problem hiding this comment.
Assuming my analysis above is correct, I think you're right that sorting would give us a better guarantee of always doing the optimally cheapest ordering of operations. But my worry is that the cost of tracking/calculating the node size, and then doing the sort, would counteract any gain that we'd get.
When I try to reproduce the timing numbers, I don't get the same results as codspeed:
"main" is the current From my testing, "always" seems like a clear winner, at least for these two repos. I'm going to repush the PR to that state to see if maybe there was some weird temporary inconsistency in the codspeed results? |
18f5b2a to
7c30063
Compare
7c30063 to
003ea9b
Compare
* main: (62 commits) [`refurb`] Do not add `abc.ABC` if already present (`FURB180`) (#22234) [ty] Add a new `assert-type-unspellable-subtype` diagnostic (#22815) [ty] Avoid duplicate syntax errors for `await` outside functions (#22826) [ty] Fix unary operator false-positive for constrained TypeVars (#22783) [ty] Fix binary operator false-positive for constrained TypeVars (#22782) [ty] Fix false-positive `unsupported-operator` for "symmetric" TypeVars (#22756) [`pydocstyle`] Clarify which quote styles are allowed (`D300`) (#22825) [ty] Use distributed versions of AND and OR on constraint sets (#22614) [ty] Add support for dict literals and dict() calls as default values for parameters with TypedDict types (#22161) Document `-` stdin convention in CLI help text (#22817) [ty] Make `infer_subscript_expression_types` a method on `Type` (#22731) [ty] Simplify `OverloadLiteral::spans` and `OverloadLiteral::parameter_span` (#22823) [ty] Require both `*args` and `**kwargs` when calling a `ParamSpec` callable (#22820) [ty] Handle tagged errors in conformance (#22746) Add `--color` cli option to force colored output (#22806) Identify notebooks by LSP didOpen instead of `.ipynb` file extension (#22810) [ty] Fix docstring rendering for literal blocks after doctests (#22676) [ty] Update salsa to fix out-of-order query validation (#22498) [ty] Inline cycle initial and recovery functions (#22814) [ty] Pass the generic context through the decorator (#22544) ...
There are some pathological examples where we create a constraint set which is the AND or OR of several smaller constraint sets. For example, when calling a function with many overloads, where the argument is a typevar, we create an OR of the typevar specializing to a type compatible with the respective parameter of each overload.
Most functions have a small number of overloads. But there are some examples of methods with 15-20 overloads (pydantic, numpy, our own auto-generated
__getitem__for large tuple literals). For those cases, it is helpful to be more clever about how we construct the final result.Before, we would just step through the
Iteratorof elements and accumulate them into a result constraint set. That results in anO(n)number of calls to the underlyingandororoperator — each of which might have to construct a large temporary BDD tree.AND and OR are both associative, so we can do better! We now invoke the operator in a "tree" shape (described in more detail in the doc comment). We still have to perform the same number of calls, but more of the calls operate on smaller BDDs, resulting in a much smaller amount of overall work.