-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-English search support #2393
Comments
Another solution is to use |
UpdateI've fixed the major problems mentioned before. Now the teaser works well with CJK text without spaces as word splitters. And it can also return the correct result for "multiple words joined together as a single keyword", like this: I added keyword highlighting in the breadcrumbs part, in case the keywords don't occur in the document body at all: (Opinionated) Preserves line-breaks in teasers (but removes indentations, because those are hard to get right) All these are the behavior of the "fallback" strategy and would currently only be enabled if the book's |
How does "fallback" strategy work?The main reason Chinese search didn't work before is a mixture of 2 facts:
So I put my effort into implementing "phrase search", and with that no "Chinese word splitting" algorithm is needed. TokenizationText are basically divided into 4 categories:
IdeographsWhen indexed, each character is treated as a separate word. When counted during teaser generation, 2 characters are counted as 1 word. Ideographs like Chinese characters have a wide range of character varieties, that even using single character for index searching could narrow down the result enough to be further processed by offline JS. Hangul syllables are technically not ideographs, but they share similar characteristics. EmojiEmoji Modifier Sequences and Zero Width Joiners are handled so that an emoji icon is treated as 1 word. Non WordThese are not indexed, and not counted as words during teaser generation. DefaultThese are separated by anything that is not
though there aren't spaces between them. Teaser generationThis is a huge pain so I'll make it brief. FilteringFor "phrase searching", one thing worth noting is that the result returned by elasticlunr might not be valid, for example with keyword
But it's obviously not what we want. So the new searching strategy would be only using elasticlunr as the first pass of filtering. It then uses Regex to apply extra filtering on the returned results. HighlightingOnly the above mentioned Display range selectionOne key difference from the existing implementaion is that the range might not be contiguous. Another difference which is opinionated is that I choose to force the teaser to include each matched keyword at least once, so with keyword
would get a teaser like:
By the way, I decided that it's quite pointless to show half a clause (a sentence is composed of clauses) containing a highlighted keyword, and as a result if the book's |
Why doesn't Emoji work?mdBook uses But there's a flaw in
let mut iter = token.chars();
if let Some(character) = iter.next() {
let mut item = self During index building, While the JS library does it in this way: elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
var root = root || this.root,
idx = 0;
while (idx <= token.length - 1) {
var key = token[idx]; The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters. And currently mdBook cannot handle these, with or without my patch. |
@ehuss what do you think about these design choices? |
…ish, r=notriddle Make html rendered by rustdoc allow searching non-English identifier / alias Fix alias search result showing `undefined` description. Inspired by rust-lang/mdBook#2393 . Not sure if it's worth it adding full-text search functionality to rustdoc rendered html.
Rollup merge of rust-lang#126057 - Sunshine40:rustdoc-search-non-english, r=notriddle Make html rendered by rustdoc allow searching non-English identifier / alias Fix alias search result showing `undefined` description. Inspired by rust-lang/mdBook#2393 . Not sure if it's worth it adding full-text search functionality to rustdoc rendered html.
…triddle Make html rendered by rustdoc allow searching non-English identifier / alias Fix alias search result showing `undefined` description. Inspired by rust-lang/mdBook#2393 . Not sure if it's worth it adding full-text search functionality to rustdoc rendered html.
#1081 has been stuck for a while, so I tried implementing my own version.
Preview the search functionality online
Inspired by #1496.
Major implementation steps:
elasticlunr-rs
's indexing implemantation (detail)Unresolved questions:
Should(Not likely, since it causes severe binary size bloat)search-non-english
feature be enabled by default?Footnotes
↩
The text was updated successfully, but these errors were encountered: