-
-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For strings longer than 42 characters, fuzzy search cannot return all relevant results. #2795
Comments
Rel. discussion in TG https://t.me/manticoresearch_en/1659 |
we have 42 codepoints constant hardcoded at tokenizer, dictionary, matching engine - i sure there is no easy way to lift up this |
We discussed this with @tomatolog. The reason for the 42 codepoint limit is to ensure the token length in bytes fits within a single byte. Many data structures and algorithms depend on this. This task can be estimated as extra large. |
Is it possible to allow non_cjk characters to be segmented using ngram (ngram_len=1) |
@502925873 sure, here's an example:
|
@sanikolaev ,I just verified that setting non cjk characters in this way can use ngram, but cjk characters cannot use ngram. I need to make non_cjk + cjk characters use ngram (ngram_len=1) at the same time. Is it possible? |
If it can be realized that non cjk+cjk characters can use ngram at the same time, then this function will be realized, and there is no need to consider the 42 character limit issue. |
Here's an example:
|
Note however it uses |
@sanikolaev It works fine and the 42 character limit problem is solved perfectly. Thank you so much! |
Proposal:
In my usage scenario, there will be some strings (not words) composed of letters and numbers, such as URLs, keys, etc. (the length will be greater than 42). When using fuzzy matching, all relevant records cannot be found. For example:
URL: https://www.google.com/search?q=%E7%BF%BB%E8%AF%91&sca_esv=48f637e9bc078c4b&sxsrf=ADLYWIIEWv04TETXF6O3I9HG6cCphhZaCQ%3A1730801209572&ei=Oe4pZ7zTIue4kPIP9LKk2Aw&oq=&gs_lp=Egxnd3Mtd2l6LXNlcnAiACoCCAAyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gJIngtQAFgAcAJ4AJABAJgBAKABAKoBALgBAcgBAPgBAZgCAqACC6gCCpgDCZIHATKgBwA&sclient=gws-wiz-serp
Key: fghjkfuyaksdfgshdlugsdhhasdhohcxjmhcvawidshkgaskdsdhcvhxbzasdhczjhbdaskabwdjkasbjcvxmvasf
Fuzzy search using characters after the 42nd character cannot match the corresponding record.
Hope this usage scenario can be supported, thank you.
Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
The text was updated successfully, but these errors were encountered: