Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For strings longer than 42 characters, fuzzy search cannot return all relevant results. #2795

Open
1 of 6 tasks
502925873 opened this issue Nov 27, 2024 · 10 comments
Open
1 of 6 tasks
Assignees

Comments

@502925873
Copy link

Proposal:

In my usage scenario, there will be some strings (not words) composed of letters and numbers, such as URLs, keys, etc. (the length will be greater than 42). When using fuzzy matching, all relevant records cannot be found. For example:

URL: https://www.google.com/search?q=%E7%BF%BB%E8%AF%91&sca_esv=48f637e9bc078c4b&sxsrf=ADLYWIIEWv04TETXF6O3I9HG6cCphhZaCQ%3A1730801209572&ei=Oe4pZ7zTIue4kPIP9LKk2Aw&oq=&gs_lp=Egxnd3Mtd2l6LXNlcnAiACoCCAAyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gIyBxAjGCcY6gJIngtQAFgAcAJ4AJABAJgBAKABAKoBALgBAcgBAPgBAZgCAqACC6gCCpgDCZIHATKgBwA&sclient=gws-wiz-serp

Key: fghjkfuyaksdfgshdlugsdhhasdhohcxjmhcvawidshkgaskdsdhcvhxbzasdhczjhbdaskabwdjkasbjcvxmvasf

Fuzzy search using characters after the 42nd character cannot match the corresponding record.
Hope this usage scenario can be supported, thank you.

Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

  • Implementation completed
  • Tests developed
  • Documentation updated
  • Documentation reviewed
  • Changelog updated
  • OpenAPI YAML updated and issue created to rebuild clients
@sanikolaev
Copy link
Collaborator

Rel. discussion in TG https://t.me/manticoresearch_en/1659

@tomatolog
Copy link
Contributor

we have 42 codepoints constant hardcoded at tokenizer, dictionary, matching engine - i sure there is no easy way to lift up this

@sanikolaev
Copy link
Collaborator

We discussed this with @tomatolog. The reason for the 42 codepoint limit is to ensure the token length in bytes fits within a single byte. Many data structures and algorithms depend on this.

This task can be estimated as extra large.

@502925873
Copy link
Author

502925873 commented Nov 28, 2024

Is it possible to allow non_cjk characters to be segmented using ngram (ngram_len=1)
@tomatolog

@sanikolaev
Copy link
Collaborator

@502925873 sure, here's an example:

mysql> drop table if exists t ; create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk' charset_table='cjk'; call keywords('abc', 't');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk' charset_table='cjk'
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
call keywords('abc', 't')
--------------

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | a         | a          |
| 2    | b         | b          |
| 3    | c         | c          |
+------+-----------+------------+
3 rows in set (0.00 sec)

@502925873
Copy link
Author

@sanikolaev ,I just verified that setting non cjk characters in this way can use ngram, but cjk characters cannot use ngram. I need to make non_cjk + cjk characters use ngram (ngram_len=1) at the same time. Is it possible?

@502925873
Copy link
Author

If it can be realized that non cjk+cjk characters can use ngram at the same time, then this function will be realized, and there is no need to consider the 42 character limit issue.

@sanikolaev
Copy link
Collaborator

Here's an example:

mysql> drop table if exists t ; create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk,cjk' charset_table='-'; call keywords('abc幸福', 't');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(id bigint, f text) ngram_len='1' ngram_chars='non_cjk,cjk' charset_table='-'
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
call keywords('abc幸福', 't')
--------------

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | a         | a          |
| 2    | b         | b          |
| 3    | c         | c          |
| 4    | 幸        | 幸         |
| 5    | 福        | 福         |
+------+-----------+------------+
5 rows in set (0.00 sec)

@sanikolaev
Copy link
Collaborator

Note however it uses charset_table='-'. You may want to use another safer character, so it doesn't cause problems (since in this case - becomes a non-delimiter).

@502925873
Copy link
Author

@sanikolaev It works fine and the 42 character limit problem is solved perfectly. Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants