does it has some pre-tokenizer method like es `whitespace tokenizer` #1539

guxingke · 2023-10-27T09:14:15Z

guxingke
Oct 27, 2023

like https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html#analysis-whitespace-tokenizer

In the Chinese word tokenizer scenario, I think CJK is almost unavailable. Is there a mechanism like Elastic Search Whitespace Tokenizer that allows users to segment words in advance and then index them with ManticoreSearch? If so, how to do it? thanks.

sanikolaev · 2023-10-27T15:47:20Z

sanikolaev
Oct 27, 2023
Maintainer

Hi @guxingke

Manticore splits by whitespace by default:

mysql> drop table if exists t; create table t(f text) charset_table='cjk'; insert into t(f) values('中 文'); select highlight() from t where match('中');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text) charset_table='cjk'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('中 文')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select highlight() from t where match('中')
--------------

+----------------+
| highlight()    |
+----------------+
| <b>中</b> 文   |
+----------------+
1 row in set (0.01 sec)

You can fine-tune the beheaviour using charset_table

But it may be easier to use the ICU Chinese morphology:

mysql> drop table if exists t; create table t(f text) charset_table='cjk,non_cjk' morphology='icu_chinese'; insert into t(f) values('我喜欢学习中文'); select highlight() from t where match('中文');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text) charset_table='cjk,non_cjk' morphology='icu_chinese'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('我喜欢学习中文')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select highlight() from t where match('中文')
--------------

+---------------------------------+
| highlight()                     |
+---------------------------------+
| 我 喜欢 学习 <b>中文</b>        |
+---------------------------------+
1 row in set (0.00 sec)

Here's an interactive course about it - https://play.manticoresearch.com/icu-chinese/

Alternatively, you can use ngram_chars

0 replies

guxingke · 2023-10-28T10:23:36Z

guxingke
Oct 28, 2023
Author

mysql> desc testrt;
+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| title   | text   | indexed stored |
| content | text   | indexed stored |
| gid     | uint   |                |
+---------+--------+----------------+
mysql> call keywords('清华 大学 清华大学', 'testrt');
+------+--------------+--------------+
| qpos | tokenized    | normalized   |
+------+--------------+--------------+
| 1    | 清           | 清           |
| 2    | 华           | 华           |
| 3    | 大学         | 大学         |
| 4    | 清华大学     | 清华大学     |
+------+--------------+--------------+

thank you, as above shows, expect result is ['清华'， ''大学, '清华大学'] for my case, i guess it need to modify default dict data (#371 (comment)) ?

2 replies

guxingke Oct 30, 2023
Author

do not set morphology to 'icu_chinese' , it looks works well .

sanikolaev Oct 30, 2023
Maintainer

i guess it need to modify default dict data (#371 (comment)) ?

I've never tried, but perhaps it will help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

does it has some pre-tokenizer method like es `whitespace tokenizer` #1539

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

does it has some pre-tokenizer method like es whitespace tokenizer #1539

guxingke Oct 27, 2023

Replies: 2 comments · 2 replies

sanikolaev Oct 27, 2023 Maintainer

guxingke Oct 28, 2023 Author

guxingke Oct 30, 2023 Author

sanikolaev Oct 30, 2023 Maintainer

does it has some pre-tokenizer method like es `whitespace tokenizer` #1539

guxingke
Oct 27, 2023

Replies: 2 comments 2 replies

sanikolaev
Oct 27, 2023
Maintainer

guxingke
Oct 28, 2023
Author

guxingke Oct 30, 2023
Author

sanikolaev Oct 30, 2023
Maintainer