For strings longer than 42 characters, fuzzy search cannot return all relevant results. #2795

502925873 · 2024-11-27T05:34:18Z

Proposal:

In my usage scenario, there will be some strings (not words) composed of letters and numbers, such as URLs, keys, etc. (the length will be greater than 42). When using fuzzy matching, all relevant records cannot be found. For example:

URL: https://www.google.com/search?q=%E7%BF%BB%E8%AF%91&sca_esv=48f637e9bc078c4b&sxsrf=ADLYWIIEWv04TETXF6O3I9HG6cCphhZaCQ%3A1730801209572&ei=Oe4pZ7zTIue4kPIP9LKk2Aw&oq=&gs_lp=Egxnd3Mtd2l6LXNlcnAiACoCCAAyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gJIngtQAFgAcAJ4AJABAJgBAKABAKoBALgBAcgBAPgBAZgCAqACC6gCCpgDCZIHATKgBwA&sclient=gws-wiz-serp

Key: fghjkfuyaksdfgshdlugsdhhasdhohcxjmhcvawidshkgaskdsdhcvhxbzasdhczjhbdaskabwdjkasbjcvxmvasf

Fuzzy search using characters after the 42nd character cannot match the corresponding record.
Hope this usage scenario can be supported, thank you.

Checklist:

^{To be completed by the assignee. Check off tasks that have been completed or are not applicable.}

sanikolaev · 2024-11-27T06:55:24Z

Rel. discussion in TG https://t.me/manticoresearch_en/1659

tomatolog · 2024-11-27T06:58:34Z

we have 42 codepoints constant hardcoded at tokenizer, dictionary, matching engine - i sure there is no easy way to lift up this

sanikolaev · 2024-11-27T08:22:50Z

We discussed this with @tomatolog. The reason for the 42 codepoint limit is to ensure the token length in bytes fits within a single byte. Many data structures and algorithms depend on this.

This task can be estimated as extra large.

502925873 · 2024-11-28T07:35:42Z

Is it possible to allow non_cjk characters to be segmented using ngram (ngram_len=1)
@tomatolog

sanikolaev · 2024-11-28T07:41:08Z

@502925873 sure, here's an example:

mysql> drop table if exists t ; create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk' charset_table='cjk'; call keywords('abc', 't');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk' charset_table='cjk'
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
call keywords('abc', 't')
--------------

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | a         | a          |
| 2    | b         | b          |
| 3    | c         | c          |
+------+-----------+------------+
3 rows in set (0.00 sec)

502925873 · 2024-11-28T07:52:27Z

@sanikolaev ，I just verified that setting non cjk characters in this way can use ngram, but cjk characters cannot use ngram. I need to make non_cjk + cjk characters use ngram (ngram_len=1) at the same time. Is it possible?

502925873 · 2024-11-28T08:35:17Z

If it can be realized that non cjk+cjk characters can use ngram at the same time, then this function will be realized, and there is no need to consider the 42 character limit issue.

sanikolaev · 2024-11-28T09:37:13Z

Here's an example:

mysql> drop table if exists t ; create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk,cjk' charset_table='-'; call keywords('abc幸福', 't');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk,cjk' charset_table='-'
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
call keywords('abc幸福', 't')
--------------

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | a         | a          |
| 2    | b         | b          |
| 3    | c         | c          |
| 4    | 幸        | 幸         |
| 5    | 福        | 福         |
+------+-----------+------------+
5 rows in set (0.00 sec)

sanikolaev · 2024-11-28T09:38:23Z

Note however it uses charset_table='-'. You may want to use another safer character, so it doesn't cause problems (since in this case - becomes a non-delimiter).

502925873 · 2024-11-28T10:22:37Z

@sanikolaev It works fine and the 42 character limit problem is solved perfectly. Thank you so much!

sanikolaev assigned tomatolog Nov 27, 2024

sanikolaev added the est::size_XL label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For strings longer than 42 characters, fuzzy search cannot return all relevant results. #2795

For strings longer than 42 characters, fuzzy search cannot return all relevant results. #2795

502925873 commented Nov 27, 2024

sanikolaev commented Nov 27, 2024

tomatolog commented Nov 27, 2024

sanikolaev commented Nov 27, 2024

502925873 commented Nov 28, 2024 •

edited

Loading

sanikolaev commented Nov 28, 2024

502925873 commented Nov 28, 2024

502925873 commented Nov 28, 2024

sanikolaev commented Nov 28, 2024

sanikolaev commented Nov 28, 2024

502925873 commented Nov 28, 2024

For strings longer than 42 characters, fuzzy search cannot return all relevant results. #2795

For strings longer than 42 characters, fuzzy search cannot return all relevant results. #2795

Comments

502925873 commented Nov 27, 2024

Proposal:

Checklist:

sanikolaev commented Nov 27, 2024

tomatolog commented Nov 27, 2024

sanikolaev commented Nov 27, 2024

502925873 commented Nov 28, 2024 • edited Loading

sanikolaev commented Nov 28, 2024

502925873 commented Nov 28, 2024

502925873 commented Nov 28, 2024

sanikolaev commented Nov 28, 2024

sanikolaev commented Nov 28, 2024

502925873 commented Nov 28, 2024

502925873 commented Nov 28, 2024 •

edited

Loading