-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustdoc-search: use set ops for ranking and filtering #118402
rustdoc-search: use set ops for ranking and filtering #118402
Conversation
r? @fmease (rustbot has picked a reviewer for you, use r? to override) |
Some changes occurred in HTML/CSS/JS. cc @GuillaumeGomez, @jsha |
This comment has been minimized.
This comment has been minimized.
c56f632
to
969de46
Compare
969de46
to
19111af
Compare
This comment has been minimized.
This comment has been minimized.
19111af
to
bada62b
Compare
Overall, I have no issue with this change. But for future optimizations which require to increase code size, I'd really like us to have a wide enough testsuite to ensure that it doesn't impact negatively other cases. Considering this, I'm wondering if we shouldn't start by writing these performance checks (even if not run in rustc's CI) before landing any new performance improvement PR. What do you think? |
That does seem like a good idea, yeah. Any suggestions about where to start? |
f2579f2
to
e62031e
Compare
I've added a couple more benchmarks:
I've also added more queries to the |
This comment has been minimized.
This comment has been minimized.
7bb1b2a
to
eb02d91
Compare
This comment has been minimized.
This comment has been minimized.
eb02d91
to
7b1f80c
Compare
This comment has been minimized.
This comment has been minimized.
…s, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? `@notriddle`
…s, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? ``@notriddle``
@bors r=GuillaumeGomez |
…s, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? ```@notriddle```
Rollup merge of rust-lang#118886 - GuillaumeGomez:clean-up-search-vars, r=notriddle Clean up variables in `search.js` While reviewing rust-lang#118402, I saw a few small clean ups that were needed, mostly about variables creation. r? ```@notriddle```
☔ The latest upstream changes (presumably #118900) made this pull request unmergeable. Please resolve the merge conflicts. |
This commit adds ranking and quick filtering to type-based search, improving performance and having it order results based on their type signatures. Motivation ---------- If I write a query like `str -> String`, a lot of functions come up. That's to be expected, but `String::from_str` should come up on top, and it doesn't right now. This is because the sorting algorithm is based on the functions name, and doesn't consider the type signature at all. `slice::join` even comes up above it! To fix this, the sorting should take into account the function's signature, and the closer match should come up on top. Guide-level description ----------------------- When searching by type signature, types with a "closer" match will show up above types that match less precisely. Reference-level explanation --------------------------- Functions signature search works in three major phases: * A compact "fingerprint," based on the [bloom filter] technique, is used to check for matches and to estimate the distance. It sometimes has false positive matches, but it also operates on 128 bit contiguous memory and requires no backtracking, so it performs a lot better than real unification. The fingerprint represents the set of items in the type signature, but it does not represent nesting, and it ignores when the same item appears more than once. The result is rejected if any query bits are absent in the function, or if the distance is higher than the current maximum and 200 results have already been found. * The second step performs unification. This is where nesting and true bag semantics are taken into account, and it has no false positives. It uses a recursive, backtracking algorithm. The result is rejected if any query elements are absent in the function. [bloom filter]: https://en.wikipedia.org/wiki/Bloom_filter Drawbacks --------- This makes the code bigger. More than that, this design is a subtle trade-off. It makes the cases I've tested against measurably faster, but it's not clear how well this extends to other crates with potentially more functions and fewer types. The more complex things get, the more important it is to gather a good set of data to test with (this is arguably more important than the actual benchmarking ifrastructure right now). Rationale and alternatives -------------------------- Throwing a bloom filter in front makes it faster. More than that, it tries to take a tactic where the system can not only check for potential matches, but also gets an accurate distance function without needing to do unification. That way it can skip unification even on items that have the needed elems, as long as they have more items than the currently found maximum. If I didn't want to be able to cheaply do set operations on the fingerprint, a [cuckoo filter] is supposed to have better performance. But the nice bit-banging set intersection doesn't work AFAIK. I also looked into [minhashing], but since it's actually an unbiased estimate of the similarity coefficient, I'm not sure how it could be used to skip unification (I wouldn't know if the estimate was too low or too high). This function actually uses the number of distinct items as its "distance function." This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute. This is because: * The function $F$ must be a superset of the query $Q$, so their union is just $F$ and the intersection is $Q$ and it can be reduced to $1-\frac{|Q|}{|F|}. * There are no magic thresholds. These values are only being used to compare against each other while sorting (and, if 200 results are found, to compare with the maximum match). This means we only care if one value is bigger than the other, not what it's actual value is, and since $Q$ is the same for everything, it can be safely left out, reducing the formula to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since the values are only being compared with each other, $|F|$ is fine. Prior art --------- This is significantly different from how Hoogle does it. It doesn't account for order, and it has no special account for nesting, though `Box<t>` is still two items, while `t` is only one. This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute. Unresolved questions -------------------- `[]` and `()`, the slice/array and tuple/union operators, are ignored while building the signature for the query. This is because they match more than one thing, making them ambiguous. Unfortunately, this also makes them a performance cliff. Is this likely to be a problem? Right now, the system just stashes the type distance into the same field that levenshtein distance normally goes in. This means exact query matches show up on top (for example, if you have a function like `fn nothing(a: Nothing, b: i32)`, then searching for `nothing` will show it on top even if there's another function with `fn bar(x: Nothing)` that's technically a closer match in type signature. Future possibilities -------------------- It should be possible to adopt more sorting criteria to act as a tie breaker, which could be determined during unification. [cuckoo filter]: https://en.wikipedia.org/wiki/Cuckoo_filter [minhashing]: https://en.wikipedia.org/wiki/MinHash
The hash changes are based on some tests with `arti` and various specific queries, aimed at reducing the false positive rate. Sorting the query elements so that generics always come first is instead aimed at reducing the number of Map operations on mgens, assuming if the bloom filter does find a false positive, it'll be able to reject the row without having to track a mapping. - https://hur.st/bloomfilter/?n=3&p=&m=96&k=6 Different functions have different amounts of inputs, and unification isn't very slow anyway, so figuring out a single ideal number of hash functions is nasty, but 6 keeps things low even up to 10 inputs. - https://web.archive.org/web/20210927123933/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.2442&rep=rep1&type=pdf This is the `h1` and `h2`, both derived from `h0`.
34b7265
to
bec6672
Compare
@bors r=GuillaumeGomez Rebased with the variable changes. |
☀️ Test successful - checks-actions |
Finished benchmarking commit (eeff92a): comparison URL. Overall result: no relevant changes - no action needed@rustbot label: -perf-regression Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 671.752s -> 671.988s (0.04%) |
Update Rust toolchain from nightly-2023-12-13 to nightly-2023-12-14 without any other source changes. This is an automatically generated pull request. If any of the CI checks fail, manual intervention is required. In such a case, review the changes at https://github.com/rust-lang/rust from rust-lang@3340d49 up to rust-lang@eeff92a. The log for this commit range is: rust-lang@eeff92ad32 Auto merge of rust-lang#118402 - notriddle:notriddle/ranking-and-filtering, r=GuillaumeGomez rust-lang@a90372c6e8 Auto merge of rust-lang#118213 - Urgau:check-cfg-diagnostics-rustc-cargo, r=petrochenkov rust-lang@2862500152 Auto merge of rust-lang#118919 - matthiaskrgr:rollup-02udckl, r=matthiaskrgr rust-lang@bec6672984 rustdoc-search: clean up handleSingleArg type handling rust-lang@9dfcf131b3 rustdoc-search: better hashing, faster unification rust-lang@9a9695a052 rustdoc-search: use set ops for ranking and filtering rust-lang@fd1d256d61 rustdoc-search: remove the now-redundant `validateResult` rust-lang@251d1af0d2 Rollup merge of rust-lang#118906 - Kobzol:bootstrap-is-windows, r=petrochenkov rust-lang@666353e7ba Rollup merge of rust-lang#118883 - HosseinAssaran:patch-1, r=fmease rust-lang@1dd36119d0 Rollup merge of rust-lang#118871 - tmiasko:coroutine-maybe-uninit-fields, r=compiler-errors rust-lang@dbc6ec6636 Rollup merge of rust-lang#118759 - compiler-errors:bare-unit-structs, r=petrochenkov rust-lang@f6617d050d Remove dangling check-cfg ui tests files rust-lang@5345a166fe Add more suggestion to unexpected cfg names and values rust-lang@7176b8babd Auto merge of rust-lang#118894 - dtolnay:bootstrapwrite, r=onur-ozkan rust-lang@c3def263a4 Auto merge of rust-lang#118870 - Enselic:rustc_passes-query-stability, r=compiler-errors rust-lang@56d25ba5ea Auto merge of rust-lang#118500 - ZetaNumbers:tcx_hir_refactor, r=petrochenkov rust-lang@2fdd9eda0c Auto merge of rust-lang#118534 - RalfJung:extern-type-size-of-val, r=WaffleLapkin rust-lang@066e6ffa02 Fix LLD thread flag selection for Windows targets rust-lang@c5208518fa Add `TargetSelection::is_windows` method rust-lang@f651b436ce Auto merge of rust-lang#117050 - c410-f3r:here-we-go-again, r=petrochenkov rust-lang@9f1bfe53b6 Auto merge of rust-lang#118900 - workingjubilee:rollup-wkv9hq1, r=workingjubilee rust-lang@f9078a40ee Rollup merge of rust-lang#118891 - compiler-errors:async-gen-blocks, r=eholk rust-lang@4583a0134f Rollup merge of rust-lang#118889 - matthiaskrgr:compl_2023_2, r=WaffleLapkin rust-lang@df0686b629 Rollup merge of rust-lang#118887 - smoelius:patch-1, r=Nilstrieb rust-lang@2f937c720d Rollup merge of rust-lang#118886 - GuillaumeGomez:clean-up-search-vars, r=notriddle rust-lang@5308733112 Rollup merge of rust-lang#118885 - matthiaskrgr:compl_2023, r=compiler-errors rust-lang@89d4a9bee9 Rollup merge of rust-lang#118884 - matthiaskrgr:auszweimacheins, r=Nadrieril rust-lang@18e0966f39 Rollup merge of rust-lang#118873 - lukas-code:fix_waker_getter_tracking_issue_number, r=workingjubilee rust-lang@0430782d1d Rollup merge of rust-lang#118872 - GuillaumeGomez:codeblock-attr-lint, r=notriddle rust-lang@a33f1a3d3a Rollup merge of rust-lang#118864 - farnoy:masked-load-store-fixes, r=workingjubilee rust-lang@2d1d443d7f Rollup merge of rust-lang#118858 - mu001999:dead_code/clean, r=cuviper rust-lang@77d1699756 Auto merge of rust-lang#116438 - ChrisDenton:truncate, r=thomcc rust-lang@b30e94b7bb Unbreak non-unix non-windows bootstrap rust-lang@1d78ce681e Actually parse async gen blocks correctly rust-lang@2a1acc26a0 Update compiler/rustc_pattern_analysis/src/constructor.rs rust-lang@3795cc8eb0 more clippy::complexity fixes rust-lang@046f2dea33 Typo rust-lang@58327c10c5 Add a test for a codeblock with multiple invalid attributes rust-lang@f1342f30a5 Clean up variables in `search.js` rust-lang@d707461a1a clippy::complexity fixes rust-lang@6892fcd690 simplify merging of two vecs rust-lang@a2ffff0708 Change a typo mistake in the-doc-attribute.md rust-lang@f813ccd784 also add a Miri test rust-lang@edcb7aba6b also test projecting to some sized fields at non-zero offset in structs with an extern type tail rust-lang@a47416beb5 test that both size_of_val and align_of_val panic rust-lang@bb0fd665a8 Follow guidelines for lint suggestions rust-lang@98aa20b0a7 Add test for `rustX` codeblock attribute rust-lang@d3cb25f4cf Add `rustX` check to codeblock attributes lint rust-lang@24f009c5e5 Move some methods from `tcx.hir()` to `tcx` rust-lang@04f3adb4a7 fix `waker_getters` tracking issue number rust-lang@e9b16cc2c5 rustc_passes: Enforce `rustc::potential_query_instability` lint rust-lang@95b5a80f47 Fix alignment passed down to LLVM for simd_masked_load rust-lang@fb32eb3529 Clean up CodeBlocks::next code rust-lang@df227f78c6 make it more clear what comments refer to; avoid dangling unaligned references rust-lang@b9c9b3e7a2 remove a cranelift test that doesn't make sense any more rust-lang@9ef1e35166 reject projecting to fields whose offset we cannot compute rust-lang@b1613ebc43 codegen: panic when trying to compute size/align of extern type rust-lang@6c0dbb8cc6 Remove dead codes in core rust-lang@a48cebc4b8 Coroutine variant fields can be uninitialized rust-lang@d473bdfdc3 Support bare unit structs in destructuring assignments rust-lang@0278505691 Attempt to try to resolve blocking concerns rust-lang@c6f7aa0eea Make File::create work on Windows hidden files Co-authored-by: celinval <[email protected]>
This commit adds ranking and quick filtering to type-based search, improving performance and having it order results based on their type signatures.
Preview
Profiler output: https://notriddle.com/rustdoc-html-demo-6/profile-8/index.html
Preview: https://notriddle.com/rustdoc-html-demo-6/ranking-and-filtering-v2/std/index.html
Motivation
If I write a query like
str -> String
, a lot of functions come up. That's to be expected, butString::from
should come up on top, and it doesn't right now. This is because the sorting algorithm is based on the functions name, and doesn't consider the type signature at all.slice::join
even comes up above it!To fix this, the sorting should take into account the function's signature, and the closer match should come up on top.
Guide-level description
When searching by type signature, types with a "closer" match will show up above types that match less precisely.
Reference-level explanation
Functions signature search works in three major phases:
A compact "fingerprint," based on the bloom filter technique, is used to check for matches and to estimate the distance. It sometimes has false positive matches, but it also operates on 128 bit contiguous memory and requires no backtracking, so it performs a lot better than real unification.
The fingerprint represents the set of items in the type signature, but it does not represent nesting, and it ignores when the same item appears more than once.
The result is rejected if any query bits are absent in the function, or if the distance is higher than the current maximum and 200 results have already been found.
The second step performs unification. This is where nesting and true bag semantics are taken into account, and it has no false positives. It uses a recursive, backtracking algorithm.
The result is rejected if any query elements are absent in the function.
Drawbacks
This makes the code bigger.
More than that, this design is a subtle trade-off. It makes the cases I've tested against measurably faster, but it's not clear how well this extends to other crates with potentially more functions and fewer types.
The more complex things get, the more important it is to gather a good set of data to test with (this is arguably more important than the actual benchmarking ifrastructure right now).
Rationale and alternatives
Throwing a bloom filter in front makes it faster.
More than that, it tries to take a tactic where the system can not only check for potential matches, but also gets an accurate distance function without needing to do unification. That way it can skip unification even on items that have the needed elems, as long as they have more items than the currently found maximum.
If I didn't want to be able to cheaply do set operations on the fingerprint, a cuckoo filter is supposed to have better performance. But the nice bit-banging set intersection doesn't work AFAIK.
I also looked into minhashing, but since it's actually an unbiased estimate of the similarity coefficient, I'm not sure how it could be used to skip unification (I wouldn't know if the estimate was too low or too high).
This function actually uses the number of distinct items as its "distance function." This should give the same results that it would have gotten from a Jaccard Distance$1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$ , while being cheaper to compute. This is because:
The function$F$ must be a superset of the query $Q$ , so their union is just $F$ and the intersection is $Q$ and it can be reduced to $1-\frac{|Q|}{|F|}.
There are no magic thresholds. These values are only being used to compare against each other while sorting (and, if 200 results are found, to compare with the maximum match). This means we only care if one value is bigger than the other, not what it's actual value is, and since$Q$ is the same for everything, it can be safely left out, reducing the formula to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$ . And, since the values are only being compared with each other, $|F|$ is fine.
Prior art
This is significantly different from how Hoogle does it.
It doesn't account for order, and it has no special account for nesting, though
Box<t>
is still two items, whilet
is only one.This should give the same results that it would have gotten from a Jaccard Distance$1-\frac{|A\cap{}B|}{|A\cup{}B|}$ , while being cheaper to compute.
Unresolved questions
[]
and()
, the slice/array and tuple/union operators, are ignored while building the signature for the query. This is because they match more than one thing, making them ambiguous. Unfortunately, this also makes them a performance cliff. Is this likely to be a problem?Right now, the system just stashes the type distance into the same field that levenshtein distance normally goes in. This means exact query matches show up on top (for example, if you have a function like
fn nothing(a: Nothing, b: i32)
, then searching fornothing
will show it on top even if there's another function withfn bar(x: Nothing)
that's technically a closer match in type signature.Future possibilities
It should be possible to adopt more sorting criteria to act as a tie breaker, which could be determined during unification.