Replies: 2 comments 2 replies
-
Hi @guxingke Manticore splits by whitespace by default: mysql> drop table if exists t; create table t(f text) charset_table='cjk'; insert into t(f) values('中 文'); select highlight() from t where match('中');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
create table t(f text) charset_table='cjk'
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t(f) values('中 文')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
select highlight() from t where match('中')
--------------
+----------------+
| highlight() |
+----------------+
| <b>中</b> 文 |
+----------------+
1 row in set (0.01 sec) You can fine-tune the beheaviour using charset_table But it may be easier to use the ICU Chinese morphology: mysql> drop table if exists t; create table t(f text) charset_table='cjk,non_cjk' morphology='icu_chinese'; insert into t(f) values('我喜欢学习中文'); select highlight() from t where match('中文');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
create table t(f text) charset_table='cjk,non_cjk' morphology='icu_chinese'
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t(f) values('我喜欢学习中文')
--------------
Query OK, 1 row affected (0.01 sec)
--------------
select highlight() from t where match('中文')
--------------
+---------------------------------+
| highlight() |
+---------------------------------+
| 我 喜欢 学习 <b>中文</b> |
+---------------------------------+
1 row in set (0.00 sec) Here's an interactive course about it - https://play.manticoresearch.com/icu-chinese/ Alternatively, you can use ngram_chars |
Beta Was this translation helpful? Give feedback.
0 replies
-
mysql> desc testrt;
+---------+--------+----------------+
| Field | Type | Properties |
+---------+--------+----------------+
| id | bigint | |
| title | text | indexed stored |
| content | text | indexed stored |
| gid | uint | |
+---------+--------+----------------+
mysql> call keywords('清华 大学 清华大学', 'testrt');
+------+--------------+--------------+
| qpos | tokenized | normalized |
+------+--------------+--------------+
| 1 | 清 | 清 |
| 2 | 华 | 华 |
| 3 | 大学 | 大学 |
| 4 | 清华大学 | 清华大学 |
+------+--------------+--------------+ thank you, as above shows, expect result is ['清华', ''大学, '清华大学'] for my case, i guess it need to modify default dict data (#371 (comment)) ? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
like
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html#analysis-whitespace-tokenizer
In the Chinese word tokenizer scenario, I think CJK is almost unavailable. Is there a mechanism like
Elastic Search Whitespace Tokenizer
that allows users to segment words in advance and then index them with ManticoreSearch? If so, how to do it? thanks.Beta Was this translation helpful? Give feedback.
All reactions