Japanese Language support #532

ManyTheFish · 2022-09-15T09:54:44Z

ManyTheFish
Sep 15, 2022
Collaborator

Japanese Languages support

officially supported

Current behavior, pointed out issues, and Possible enhancement

Language Detection

Current behavior

Meilisearch Language detection is handled by an external library named whatlang, then, depending on the detected Script and Language, a specialized segmenter and specialized Normalizers are chosen to tokenize the provided text.

related to:

meilisearch#2403

Possible enhancement

An issue has been created on the whatlang repository by @miiton to enhance "CJK" language detection: whatlang-rs#122

Segmentation

Meilisearch Japanese Segmentation is handled by an external library named lindera.

Normalization

Currently, there is no specialized normalization for Japanese.

Possible enhancement

We could normalize Japanese words by converting them into Hiragana, this could increase the recall of Meilisearch because:

This would allow finding Katakana words when making a Hiragana query

for instance ダメ, is also spelled 駄目, or だめ (I didn't find a better example, sorry 🙏)

Japanese keyboard seems to only contain Hiragana characters. And so, Katakana and Kanji characters are written in Hiragana by the user, then the computer will suggest a Katakana or a Kanji version of the written text.

💭 correct me if I am wrong 🙏, but this would allow Meilisearch to better retrieve typos by suggesting Kanjis characters that have a close hiragana writting instead of suggesting any character.

Troubleshooting 🆘

A Query containing Kanjis doesn't retrieve all the relevant documents

When doing a search query with only Kanji characters, the language detection doesn't classify the query as a Japanese one but as a Chinese one because:

Kanjis is a set of traditional Chinese characters used in Japanese and some are used in both Languages
whatlang doesn't have an advanced algorithm to make the difference between both Languages

💭 basically whatlang will classify texts as Mandarin unless there is at least 5% of Hiragana or Katakana characters.
So if we are in the case of only having Kanjis, it will always be classified as Mandarin, and then, the wrong tokenization pipeline will be chosen.

Workaround

The only workaround is to use a specialized Meilisearch version that deactivates the Chinese Language support. Below is the link to the PR containing all the released versions:
meilisearch/meilisearch#3882

Possible fixes

Allow the user to set the Language in the index settings

Contribute!

In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
However, if you are a native speaker, don't hesitate to contribute to enhancing this experience:

⬆️ by upvoting this discussion to help us in prioritizing language supports
💬 by pointing out errors and oversights in this discussion
🧑‍💻 by opening a pull request on charabia, the tokenizer used by Meilisearch

Thanks for your help!

miiton · 2022-09-22T01:55:31Z

miiton
Sep 22, 2022

@ManyTheFish Thank you for writing the details.

for instance ダメ, is also spelled 駄目, or だめ (I didn't find a better example, sorry 🙏)

It's fine. This is enough for Japanese to understand 👍

Language detection only with Kanji/Hanzi strings

I researched various things to see if whatlang could handle it, but it might be better not to expect too much.
This is because, as you know, Japanese kanji overlaps with Chinese, and the ratio is very large.

I'm just thinking it might be better to think of another way.

Normalization

I think Unicode NKFC is enough for Japanese normalization. ( e.g. https://github.com/unicode-rs/unicode-normalization )

It's convert...

source	unicode(source)	converted	unicode(converted)
`ｶﾞｷﾞｸﾞｹﾞｺﾞ`	`\uff76\uff9e\uff77\uff9e\uff78\uff9e\uff79\uff9e\uff7a\uff9e`	`ガギグゲゴ`	`\u30ac\u30ae\u30b0\u30b2\u30b4`

Indexing

I am very happy to be able to do ambiguous searches in hiragana, so I agree.

Since Lindera holds Pronounce by Katakana, I feel that it can be achieved by devising the indexing.

About the Japanese input method.

This is a supplement in the hope that it will be of some help.

Japanese keyboard seems to only contain Hiragana characters. And so, Katakana and Kanji characters are written in Hiragana by the user, then the computer will suggest a Katakana or a Kanji version of the written text.

There are two input methods for Japanese: "Romaji" and "Kana".

Romaji (Input by alhpabets)	Kana (Input hiragana directly)

These can switch by Japanese IME option.

Most Japanese choose romaji input because it is easy to learn the keyboard layout.

2 replies

johtani Sep 22, 2022

Great summary!

In Lucene, there is good example explanation about Romaji Input variation.
https://issues.apache.org/jira/browse/LUCENE-10102

Lucene Japanese Tokenizer can output Romaji or Katakana Pronounce information.
They have Katakana-Romaji Converter method. But it only supports Hepburn romanization and it's a bit difference between Romaji Input.

I hope this is helpful.

ManyTheFish Sep 22, 2022
Collaborator Author

Hello @miiton @johtani,

First, thank you a lot for your suggestion, It is really helpful! 🙏

Normalization

If I try to summarize your points, we have several options to normalize Japanese:

doing an NFKC or NFKD normalization in order to unify canonical and compatibility equivalences

💭 it seems to be our softer option in terms of Normalization because we only lose little information.
However, it would be impossible for Meilisearch to retrieve だめ by searching ダメ.

Making a phonological Normalization by:

Romanizing characters
converting every character in Hiragana
converting every character in Katakana

💭 these last options seem, in my sense, similar in terms of impact on Meilisearch, だめ and ダメ will be unified after normalization allowing Meilisearch to retrieve equally both versions. The only difference I see is, which keyboard should be favored.

Language detection only with Kanji/Hanzi strings

I researched various things to see if whatlang could handle it, but it might be better not to expect too much.
This is because, as you know, Japanese kanji overlaps with Chinese, and the ratio is very large.
[..]
I'm just thinking it might be better to think of another way.

When I see your graph, I understand why it would not significantly improve language detection for our case. 🤔
So I have followed your suggestion about finding another solution, and here I am:

It is almost impossible to know if a pair of kanji alone is a Chinese query or a Japanese one.
However, despite knowing the Language of the search request, we are able to know the main Languages of the Index where the user is searching, because, as a search query, we tokenize documents during the indexing.
Moreover, whatlang provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.
Therefore, we could save, in the index, the detected Languages during the indexing in order to set an allowlist at search time,
this would avoid choosing the wrong tokenization pipeline when trying to detect a small query unless we have documents of both Chinese and Japanese Languages in the same index.

ManyTheFish · 2022-10-05T15:31:50Z

ManyTheFish
Oct 5, 2022
Collaborator Author

Hello all!
I come back to you to make some updates.
For Hacktoberfest we've created several issues to enhance the Japanese support of Meilisearch:

Implement a Compatibility Decomposition Normalizer charabia#139:

A simple normalization allowing Meilisearch to retrieve "same" characters with different Unicode code points like ｶﾞ and ガ.

Implement a Japanese specialized Normalizer charabia#131:

Allow Meilisearch to retrieve Hiragana words with Katakana search queries like ダメ and だめ

Add an allowlist to the tokenizer builder charabia#132:

the first step to enhance Meilisearch Language detection by forcing Charabia to choose in a sub-set of Language avoiding it to detect Chinese when it is not in the list

Store detected Language per document during indexing milli#646:

the second step to enhance Meilisearch Language detection by detecting languages during document indexing and storing it.

All these issues are open to external contributions during the whole month, so don't hesitate to contribute! 🧑‍💻
I'll be glad to help you if you have any questions.

This is another step in enhancing Japanese Language support, depending on future feedback, we will be able to go further.

Thanks for all your feedback! ✍️ 🇯🇵

5 replies

miiton Oct 6, 2022

All great!

If they are fixed, I think it will be a product that can be used without hesitation in a Japanese. 👍👍👍👍👍

miiton Mar 13, 2023

rel & ref: meilisearch/meilisearch#3565, meilisearch/meilisearch#3569 and Twitter

@ManyTheFish

I thought it would be different to write in a closed Issue, so I will write here, which is the starting point.

I think that I exhausted the possible enhancements on Meilisearch for now.

If so, it may still be difficult to use it production in a Japanese environment at this time.
Although a series of modifications have been made to strengthen the language judgment at the time of indexing, I still felt that the "difficulty in distinguishing Chinese or Japanese in Kanji-only words" is still difficult.

It may be that you have considered this and have not adopted it, but wouldn't it be simpler and easier to use if we could add an option to specify the language like "Algolia" or document settings like languagePriorityAttributes? Without thinking too much about it, I think it would be simpler and easier to use.

Ref "Algolia settings" : indexLanguages, queryLanguages

I imagine languagePriorityAttributes to be defined as follows

// Priority to Japanese > Chinese > English > Others
"languagePriorityAttributes": ["ja", "zh", "en"]
// Priority to Chinese > Japanese > English > Others
"languagePriorityAttributes": ["zh", "ja", "en"]
// Auto = Same behavior as current implementation (v1.1.0-rc1)
"languagePriorityAttributes": []

In my opinion, given the compatibility of Instant-meilisearch and InstantSearch, it would be better for the user to match the Algolia implementation. That is, users can choose the language they want to search in.

ManyTheFish Mar 14, 2023
Collaborator Author

Hello @miiton,
about the miss-detection of the Japanese Language, I put below a PR providing a docker image that deactivates Chinese tokenization:
meilisearch/meilisearch#3588

This PR should temporarily solve your problem before we find a permanent fix!

About the possible fixes:

You suggested a setting to set the detectable Languages, that's something we have in mind, but, because it changes the API, we have to discuss with the product managers to choose the best user interface, and it takes more time than enhancing Meilisearch without changing the API. (poking @gmourier on this)
We could enhance Language detection by contributing to WhatLang following your suggestions or trying to replace the Language detection with https://github.com/quickwit-oss/whichlang.

Unless a permanent fix is implemented, I highly suggest you subscribe to the upper PR and use the linked Docker image, We will keep it up to date with Meilisearch stable versions.

Sorry for all these inconveniences, I hope that we will quickly release a permanent fix to definitely support the Japanese Language. ☺️

miiton Mar 16, 2023

@ManyTheFish

Thanks a lot! I am trying that docker image and it seems to work almost perfectly. I am trying that docker image and it seems to be working almost perfectly. Great...

I understand the other matters as well, and look forward to future updates 😍

jamsch May 29, 2023

Re: using index settings for language priority: This is kind of tricky in my situation. I've got many records in both Chinese and Japanese that are stored on the same index. If possible, I'd rather have a search param that specifies the preferred language (but I'd assume during tokenization it'd blow up storage if both cn/jp tokens are being made)

mosuka · 2022-10-06T08:47:32Z

mosuka
Oct 6, 2022

Hi all,

I also put here the comment I wrote on meilisearch/charabia#139 .

This character normalization seems to be performed after tokenization, but in some cases, it is better to perform character normalization before tokenization in Japanese.

For example, this is a case where there is no problem even after tokenization:

$ echo "私はメガネを買いました。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
メガネ	名詞,一般,*,*,*,*,メガネ,メガネ,メガネ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
買い	動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイ
まし	助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。	記号,句点,*,*,*,*,。,。,。
EOS

$ echo "私はﾒｶﾞﾈを買いました。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
ﾒｶﾞﾈ	UNK
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
買い	動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイ
まし	助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。	記号,句点,*,*,*,*,。,。,。
EOS

Half-width ﾒｶﾞﾈ is not a problem because it is tokenized as a single unknown word, although it does not exist in the Japanese morphological dictionary (IPADIC). Of course, if normalization has been done in advance, morphological analysis can be used to retrieve the part-of-speech information of words from the dictionary accurately.

But the following cases can be problematic.

$ echo "私は時給1000円です。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
時給	名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー
1000	UNK
円	名詞,接尾,助数詞,*,*,*,円,エン,エン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

$ echo "私は時給１０００円です。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
時給	名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー
１	名詞,数,*,*,*,*,１,イチ,イチ
０	名詞,数,*,*,*,*,０,ゼロ,ゼロ
０	名詞,数,*,*,*,*,０,ゼロ,ゼロ
０	名詞,数,*,*,*,*,０,ゼロ,ゼロ
円	名詞,接尾,助数詞,*,*,*,円,エン,エン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

Since full-width numbers are already registered in the morphological dictionary (IPADIC), each number becomes a single token, so a full-width １０００ cannot be searched for 1000.
For this reason, it is common for search engines that handle Japanese to perform character normalization before tokenization.

If possible, I would like you to consider a way to perform character normalization before tokenization.
Thanks!

0 replies

mosuka · 2022-10-17T04:14:51Z

mosuka
Oct 17, 2022

@ManyTheFish
This is a slide that I created previously that explains how tokenizer works in Japanese search engines for software engineers who are not familiar with the Japanese language.

I hope it will be of some help to you.

https://speakerdeck.com/mosuka/the-importance-of-morphological-analysis-in-japanese-search-engines

0 replies

miiton · 2023-03-27T05:04:45Z

miiton
Mar 27, 2023

I have published a simple application that I had made to confirm that Meilisearch works in Japanese.

https://meilisearch-example-jp.miiton.dev/

0 replies

ManyTheFish · 2023-04-26T10:30:28Z

ManyTheFish
Apr 26, 2023
Collaborator Author

Hello people!
I've been putting this message off for a long time, I'm sorry.
I want to make a complete state of the current Japanese Language support in Meilisearch.

The current behavior

Language Detection

Today, we are using whatlang-rs to detect the Script and the Language in a text, Language detection is really important for Japanese Language support mainly to make the difference with the Chinese Language when only Kanjis are used in a text, for example, a small title or a query.

Segmentation

To segment Japanese text, we are currently using Lindera based on a Viterbi algorithm using the IPA dictionaries. Thanks to @mosuka for maintaining it. (a small explanation of Japanese segmentation)

Normalization

So far, we only normalize Japanese characters by replacing them with their decomposed compatible form, to give an example, half-width kanas are converted into kanas. To know more about this, I put below some documentation about it below:
https://unicode.org/reports/tr15/

The remaining issues we should tackle in the future

The Language detection library we are using is efficient in detecting the Script used, however, it's not sufficient to make the difference between Japanese and Chinese when only Kanjis are used in a text. There are some things that could be done to enhance this:
a) Contribute to the whatlang-rs library, a first issue has been created by @miiton that could help, however, if someone sees a better approach, don't hesitate to suggest it.
b) Replace the Library, here I didn't make any comparisons with other ones.
c) Add a configuration in Meilisearch allowing the user to authorize some Language to be detected or not.
The compatibility decomposition normalization is made after segmentation, however, @mosuka suggested finding a way to put it before, here is the link to the comment.
We had a discussion with @mosuka about converting all the Katakana characters into Hiragana characters during the normalization, this would be able to raise the recall, by allowing any form of a word to find the other one (ダメ <-> だめ). This feature is currently deactivated in Meilisearch but here is the link to the discussion.

Prototypes

There is a prototype of Meilisearch deactivating completely the Chinese support, this way we avoid Language detection mistakes, in addition, this prototype activates the katakana-to-hiragana conversion, if you want to try this prototype I put the link to it:
meilisearch/meilisearch#3588

Thanks!

0 replies

miiton · 2023-05-14T04:47:29Z

miiton
May 14, 2023

Handling of Proper Nouns in Japanese

@ManyTheFish

Is the issue of not being able to search for proper nouns that are not in ipadic already being discussed, such as Chinese language support, or etc?

ref: misskey-dev/misskey/issues/10845

Target Contents

id	msg
1	Misskeyはしゅいろさんが開発しています
2	Misskeyはしゅいろさんがつくったのだよ

# echo "Misskeyはしゅいろさんがつくったのだよ" | lindera

Misskey UNK
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
し      動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ゅいろさんがつくったのだよ      UNK

# echo "Misskeyはしゅいろさんが開発しています" | lindera

Misskey UNK
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
し      動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ゅいろさんが    UNK
開発    名詞,サ変接続,*,*,*,*,開発,カイハツ,カイハツ
し      動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
い      動詞,非自立,*,*,一段,連用形,いる,イ,イ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス

Search Word: `しゅいろ`

# echo "しゅいろ" | lindera

し      動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ゅいろ  UNK

Results

❌ Meilisearch (v1.1.1 & prototype-japanese-2)

id:2 no hits

# curl -s -H "Authorization: Bearer hoge" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/search -d '{"q":"しゅいろ","attributesToHighlight":["msg"]}' | jq -r '.hits[] | [.id, ._formatted.msg] | @tsv'
1       Misskeyはし<em>ゅいろ</em>さんが開発<em>し</em>ています

⭕ Algolia

⭕ Manticore Search

mysql> CREATE TABLE notes(id bigint, msg text) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk';
mysql> INSERT INTO notes(id, msg) VALUES (1, 'Misskeyはしゅいろさんが開発しています');
mysql> INSERT INTO notes(id, msg) VALUES (2, 'Misskeyはしゅいろさんがつくったのだよ');
mysql> SELECT id, msg, highlight() FROM notes WHERE match('しゅいろ');
+------+------------------------------------------------------+---------------------------------------------------------------------------+
| id   | msg                                                  | highlight()                                                               |
+------+------------------------------------------------------+---------------------------------------------------------------------------+
|    2 | Misskeyはしゅいろさんがつくったのだよ                | Misskeyは<b>しゅいろ</b>さんがつくったのだよ                              |
|    1 | Misskeyはしゅいろさんが開発しています                | Misskeyは<b>しゅいろ</b>さんが開発<b>し</b>て<b>い</b>ます                |
+------+------------------------------------------------------+---------------------------------------------------------------------------+

28 replies

mosuka May 25, 2023

@miiton @ManyTheFish
Thanks for the explanation.
It seems to me that this problem could be solved if we could use UniDic to have consistent tokenization for searching and indexing, and also relax the limit on the number of words that can be registered for a multi-word synonyms.
What do you think?

mosuka Jun 14, 2023

@ManyTheFish
I checked how much the binary size of lindera-cli changes between IPADIC and UniDic.
UniDic is about 30MB larger than IPADIC when built in release mode.
Nevertheless, I would like to switch to UniDic because of the expected improvement in search quality in Japanese, what do you think?

$ cargo build --features=ipadic,ipadic-compress --release
$ ls -alh target/release/lindera
-rwxr-xr-x 2 minoru minoru 15M Jun 12 21:51 target/release/lindera

$ cargo build --features=unidic,unidic-compress --release
$ ls -alh target/release/lindera
-rwxr-xr-x 2 minoru minoru 48M Jun 12 21:54 target/release/lindera

ManyTheFish Jun 15, 2023
Collaborator Author

Hello @mosuka,
You can add the UniDic implementation to Charabia under a feature flag, then use the japanese feature flag as a "meta" flag for the Japanese default behavior like:

# allow japanese specialized tokenization
japanese = ["japanese-segmentation-unidic"]
japanese-segmentation-ipadic = ["lindera-tokenizer/ipadic", "lindera-tokenizer/ipadic-compress"]
japanese-segmentation-unidic = ["lindera-tokenizer/unidic", "lindera-tokenizer/unidic-compress"]
japanese-transliteration = ["dep:wana_kana"]

What do you think? 🙂

mosuka Jun 15, 2023

That's a good idea.
I'll send a pull request this weekend.
Thanks! 😃

mosuka Jun 17, 2023

@ManyTheFish
I sent a pull request.
meilisearch/charabia#218

Thanks.

ManyTheFish · 2023-07-31T14:36:01Z

ManyTheFish
Jul 31, 2023
Collaborator Author

Hello everyone 👋

An update on Meilisearch and the Japanese support

New release V1.3 🦁

v1.3 has been released today 🦁 including a change in the Japanese segmentation, Meilisearch now relies on UniDic instead of IPADIC to segment Japanese words which should increase the amount of document retrieved by Meilisearch.

We still encounter difficulties when a dataset contains small documents with kanji-only fields, if you don't manage to retrieve documents containing kanji-only fields with the official Meilisearch version, please try the Japanese specialized docker image that deactivates other Language support.

A preview of V1.4 👀

We just released a 🧪 prototype that allows the users to customize how Meilisearch tokenizes documents and queries, and we'd love your feedback.
Additionally to the already existing stopWords and synonyms this prototype provides:

nonSeparatorTokens: allows removing some tokens from the default list of separators
separatorTokens: allows to add some tokens to the list of separators
dictionary: allows to override the word segmentation on the list of words defined in the dictionary

How to get the prototype?

Using docker, use the following command:

docker run -p 7700:7700 -v $(pwd)/meili_data:/meili_data getmeili/meilisearch:prototype-tokenizer-customization-2

From source, compile Meilisearch on the prototype-tokenizer-customization-2 tag

How to use the prototype?

You can find some examples below, or look at the original PR for more info.

We know that the nonSeparatorTokens and separatorTokens aren't as useful in Japanese as in other languages, however, the dictionary setting should allow defining Japanese proper nouns, company names, or any specific words to better tokenize documents related to a specific use case and retrieve them easily!

⚠️ We do NOT recommend using this prototype in production. This is for test purposes only.

Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️

0 replies

miiton · 2023-09-11T07:22:51Z

miiton
Sep 11, 2023

Facet Search is not working as I expected in Japanese.

I am trying Facet Search on a Japanese demo site I have published, but it doesn't seem to work the way I want it to.

I am trying to narrow down the prefectures... the example of Osaka-fu is easy to understand.

( Meilisearch version: prototype-japanese-5 )

If I type in `大阪`, it returns `大分県, 大阪府`.

I think it is because of the ambiguous search that 大分県 is hit, but I would like the results to be returned in the order of 大阪府, 大分県 at least.

If I type in `大阪府`, it returns `京都府, 大分県, 大阪府`.

I think it is a little strange for a "filter" to increase the number of results even though the number of input characters is increased.

0 replies

hktang · 2023-11-07T04:17:06Z

hktang
Nov 7, 2023

Hi! Thank you guys for your fantastic work on improving Japanese support.

I tried the Docker image meilisearch/prototype-japanese-7, which works really well in my case (Drupal search API + Meilisearch backend).

I have two questions:

I noticed the image is numbered 0-7. Where can I find information about new releases so I can use the latest one?
Is there a configuration option in Meilisearch to explicitly set the fallback tokenizer to the Japanese one?

My apologies if this is not the right place to ask. Thank you for any input!

2 replies

ManyTheFish Nov 7, 2023
Collaborator Author

Hello @hktang,
First of all, if you have any questions, suggestions, or relevancy issues with the Meilisearch Japanese support, this discussion is the right place. 👍

I noticed the image is numbered 0-7. Where can I find information about new releases so I can use the latest one?

Yes, sorry, the link to the dedicated PR with all the released versions is below:
meilisearch/meilisearch#3882

I will link this PR in a troubleshooting section here.

Is there a configuration option in Meilisearch to explicitly set the fallback tokenizer to the Japanese one?

Not so far, but there is a feature request to let the users choose the Language used on each index, I put the link below:
https://github.com/orgs/meilisearch/discussions/702

don't hesitate to react or comment in the discussion explaining your use case, this helps us in our prioritization process. ☺️

Don't hesitate to ask more questions if you need them,
See you! 👋

hktang Nov 8, 2023

Hi @ManyTheFish great! Thank you so much for the information. This is exactly what I was looking for.
Sure, I definitely look forward to having the language selector feature. Will give it a bump!

bedus-creation · 2023-11-09T22:21:14Z

bedus-creation
Nov 9, 2023

Thanks for the Great Tools.
We were experimenting with this getmeili/meilisearch:prototype-japanese-7 image in our Team. And It seems like it still has some incorrect results.

My search query was '大学' and it is showing results for '大垣' as well which doesn't make sense.

2 replies

mosuka Nov 9, 2023

Hi @bedus-creation ,
This may be due to typo tolerance.
In Japanese, a single letter can have a completely different meaning in some cases, so unintended documents may appear in the search results.

You may be able to solve the problem by changing the setting.

https://blog.meilisearch.com/typo-tolerance/amp/

ManyTheFish Nov 13, 2023
Collaborator Author

Hi @bedus-creation and @mosuka,
There is another feature that could cause this behavior: the matchingStrategy, this feature allows Meilisearch to remove some token if there is not enough results to show, but you can easily deactivate this feature by changing the matchingStrategy directly in the search request:

$ curl \
  -X POST 'http://localhost:7700/indexes/movies/search' \
  -H 'Content-Type: application/json' \
  --data-binary '{
    "q": "大学",
+   "matchingStrategy": "all"
  }'

Don't hesitate to come back to us if none of these solutions work, and thank you @mosuka, for your help,
see you!

tec-koshi · 2024-01-12T08:31:50Z

tec-koshi
Jan 12, 2024

こんにちは！
以下のような
meilisearch:prototype-japanese-7
https://hub.docker.com/layers/getmeili/meilisearch/prototype-japanese-7/images/sha256-0cbcaafc43d3db7d7e934e4d1340c18312b775e4426db75c46bd0d580f7b9cfd?context=explore
Docker Imageの日本語特化検索機能はMeilisearch Cloudでは利用されているのでしょうか？
何がしたいかと言いますと、Meilisearch Cloudでは、こちらのDocker Imageのように日本語に特化した検索はできますでしょうか？

Hello!
like below
meilisearch:prototype-japanese-7
https://hub.docker.com/layers/getmeili/meilisearch/prototype-japanese-7/images/sha256-0cbcaafc43d3db7d7e934e4d1340c18312b775e4426db75c46bd0d580f7b9cfd?context=explore
Is Docker Image's Japanese-specific search function used in Meilisearch Cloud?
What I want to do is, with Meilisearch Cloud, is it possible to search specifically for Japanese like this Docker Image?

4 replies

miiton Jan 12, 2024

@tec-koshi
ref: https://twitter.com/StriftCodes/status/1732376643073724511

可能というお話は見ましたが、Cloudサポートに問い合わせるのが確実です。

I have been told that it is possible, but I think it is best to contact Cloud Support to be sure.

tec-koshi Jan 15, 2024

@miiton
回答ありがとうございます。
Cloudサポートに問い合わせてみます。
ちなみにでしたが、Meilisearchの問い合わせ先を見つけることができずにDiscordにて問い合わせしました。
クラウドサポートの先知っておられましたら教えていただきたかったです。
よろしくお願いします。

miiton Jan 15, 2024

Discordで大丈夫ですよ 👍

ref: https://help.meilisearch.com/en/article/where-can-i-find-support-1qgucwt/

tec-koshi Jan 15, 2024

ありがとうございます！

ManyTheFish · 2024-01-15T16:13:53Z

ManyTheFish
Jan 15, 2024
Collaborator Author

Hello All,
The Japanese Language-specialized docker image, up to date with the last Meilisearch version (v1.6.0), has been released:

$ docker pull getmeili/meilisearch:prototype-japanese-9

If you want more information about the last release, I put below the link to it:
https://github.com/meilisearch/meilisearch/releases/tag/v1.6.0

Below is the PR with all the Japanese Language-specialized docker images:
meilisearch/meilisearch#3882

See you!

0 replies

kamiyn · 2024-05-21T15:11:19Z

kamiyn
May 21, 2024

Specify an user dictionary for Japanese:

I am impressed with the creation of such a wonderful search engine! 😀

Japanese does not perform word segmentation using spaces, so a good dictionary is necessary to determine word boundaries.

Particularly for new words or proper nouns, registration of words in a user dictionary may be required.

Lindera has a feature to specify a user dictionary,
and I am experimenting to specify a user dictionary via environment variable.
meilisearch/charabia@main...kamiyn:charabia:users/kamiyn/japanese_user_dictionary

If this is appropriate, I would like to make PR with it.

2 replies

ManyTheFish May 27, 2024
Collaborator Author

Hello @kamiyn, thank you for your suggestion,
To be honest with you, I would gladly accept a contribution on Charabia's side to implement this option; however, adding this in Meilisearch needs more work in terms of API definition.
This means that even if we integrate your work on Charabia, the chances of being fully integrated into Meilisearch are very small.

Today, there is a dictionary feature in Meilisearch allowing you to precise some specific words related to your dataset, this is not as advanced as Lindera's one, but it may be fit your needs, let me know about it.

See you!

kamiyn May 28, 2024

Hello @ManyTheFish, thank you for your response.

I also recognize that a lot of work is needed to make custom dictionaries easily accessible to end-users in the MeiliSearch services.

On the other hand, it is common to specify parameters via environment variables when using Docker, so I thought it would be beneficial.

Especially if it is provided as an explanation of how to use the Japanese-specific Docker image, I believe that the number of users via Docker will increase.

In that sense, this modification into Charabia seems to be an issue of how the Japanese-specific Docker image is positioned.

ManyTheFish · 2024-05-27T08:30:59Z

ManyTheFish
May 27, 2024
Collaborator Author

Hello All,
The Japanese Language-specialized docker image, up to date with the last Meilisearch version (v1.8.1), has been released:

$ docker pull getmeili/meilisearch:prototype-japanese-11

If you want more information about the last release, I put below the link to it:

Below is the PR with all the Japanese Language-specialized docker images:
meilisearch/meilisearch#3882

See you!

0 replies

bedus-creation · 2024-07-05T06:49:01Z

bedus-creation
Jul 5, 2024

Thanks for the great support & great software

We started using getmeili/meilisearch:prototype-japanese-11 in production, however I found that the default configuration doesn't seems to support hiragana and katakana search, for example, searching ニホン should also return にほん、.

Is there anyway to get back this features ?

6 replies

bedus-creation Jul 7, 2024

It seems like wana_kana library is being used to convert kanji and katakana to hiragana, but it doesn't search interchangeably, is it a issue ? does anyone has idea ?

https://github.com/meilisearch/charabia/blob/ae07a589ab5fa825e7fb68777b76f81e155be1b9/charabia/src/normalizer/japanese.rs#L21-L33

ManyTheFish Jul 9, 2024
Collaborator Author

Hello @miiton and @bedus-creation,
wana_kana was a solution to unify hiragana and katakana characters.
However, this normalizer has been deactivated because the library contains hardcoded panic!, which causes Meilisearch to crash.
To reactivate this feature, we must ensure that no crash can occur by modifying the library or using another one.
I know that Mosuka uses https://crates.io/crates/kanaria in Lindera; it might be worth trying it.

PSeitz Oct 1, 2024

Hi, I fixed the panic with wana_kana 4.0 and also added some proptests as a safeguard.

ManyTheFish Oct 2, 2024
Collaborator Author

Nice!
The feature can be reactivated for Meilisearch v1.12 so :)

Thank you @PSeitz

tats-u Oct 14, 2024

meilisearch/charabia#312

tats-u · 2024-10-03T10:37:04Z

tats-u
Oct 3, 2024

For those who try Meilisearch for the first time: you should try v1.10.2 (or any later versions) with "locales": ["jpn"] first today. prototype-japanese-* images are based on older versions.
If you're using prototype-japanese-12 now for example:

-12 → -184 → (dump migration) → -13 → (dump migration) → v1.10.2

0 replies

mosuka · 2024-11-23T06:21:02Z

mosuka
Nov 23, 2024

@ManyTheFish

I would like to enable Lindera's character filters and token filters in Charabia.
For example, if I use Lindera's RegexCharacterFilter or JapaneseStopTagsFilter, the correct offset and the other values are recorded in Lindera's Token, but only Lindera's token.text is passed to Charabia.

I am thinking that if I don't create a Charabia token at this time based on values other than text recorded by Lindera's token, the term will be out of position due to highlighting, etc.

I made it possible to describe Lindera settings in YAML, we would be very happy if this could be accomplished, allowing Japanese-specific string handling to be configured by out of Meilisearch using environment variable.

Is there a better way to do this?

https://github.com/meilisearch/charabia/blob/467b3e4e767c58d54a8f2974813e3ed6ac6c9795/charabia/src/segmenter/japanese.rs#L33-L36

1 reply

ManyTheFish Dec 3, 2024
Collaborator Author

Hey @mosuka,

I think we could add a metadata field in the charabia::Token, that could be an enum:

type LinderaMetadata<'o> = Vec<Cow<'o, str>>;

enum Metadata<'o> {
  #[default]
  None,
  Lindera(LinderaMetadata<'o>),
}

Then, the segmenter trait should implement a new segment_str_with_meta method returning (&'o str, Metadata):

pub trait Segmenter: Sync + Send {
    /// Segments the provided text, creating an Iterator over `&str`.
    fn segment_str<'o>(&self, s: &'o str) -> Box<dyn Iterator<Item = &'o str> + 'o>;
    

    /// Segments the provided text, creating an Iterator over `Token`.
    ///
    /// This method uses `segment_str` by default
    fn segment_str_with_meta<'o>(&self, s: &'o str) -> Box<dyn Iterator<Item = (&'o str, Metadata<'o>)> + 'o> {
      Box::new(self.segment_str(s).map(|s| (s, Metadata::None)))
    }
}

But Lindera would override the implementation by itself and add the Metadata.

After this, we just have to. make SegmentedTokenIter use segment_str_with_meta instead_of segment_str.

It's a bit of work but it's totally acceptable for Charabia and seems future proof for other tokenizers 🤔

Japanese Language support #532

ManyTheFish Sep 15, 2022 Collaborator

Japanese Languages support

Current behavior, pointed out issues, and Possible enhancement

Language Detection

Current behavior

Possible enhancement

Segmentation

Normalization

Possible enhancement

Troubleshooting 🆘

A Query containing Kanjis doesn't retrieve all the relevant documents

Workaround

Possible fixes

Contribute!

Replies: 18 comments · 52 replies

Language detection only with Kanji/Hanzi strings

Normalization

Indexing

About the Japanese input method.

ManyTheFish Sep 22, 2022 Collaborator Author

Normalization

Language detection only with Kanji/Hanzi strings

ManyTheFish Oct 5, 2022 Collaborator Author

ManyTheFish Mar 14, 2023 Collaborator Author

ManyTheFish Apr 26, 2023 Collaborator Author

The current behavior

Language Detection

Segmentation

Normalization

The remaining issues we should tackle in the future

Prototypes

Handling of Proper Nouns in Japanese

Target Contents

Search Word: しゅいろ

Results

❌ Meilisearch (v1.1.1 & prototype-japanese-2)

⭕ Algolia

⭕ Manticore Search

ManyTheFish Jun 15, 2023 Collaborator Author

ManyTheFish Jul 31, 2023 Collaborator Author

New release V1.3 🦁

A preview of V1.4 👀

How to get the prototype?

How to use the prototype?

Facet Search is not working as I expected in Japanese.

If I type in 大阪, it returns 大分県, 大阪府.

If I type in 大阪府, it returns 京都府, 大分県, 大阪府.

ManyTheFish Nov 7, 2023 Collaborator Author

ManyTheFish Nov 13, 2023 Collaborator Author

ManyTheFish Jan 15, 2024 Collaborator Author

ManyTheFish May 27, 2024 Collaborator Author

ManyTheFish
Sep 15, 2022
Collaborator

Replies: 18 comments 52 replies

ManyTheFish Sep 22, 2022
Collaborator Author

ManyTheFish
Oct 5, 2022
Collaborator Author

ManyTheFish Mar 14, 2023
Collaborator Author

ManyTheFish
Apr 26, 2023
Collaborator Author

Search Word: `しゅいろ`

ManyTheFish Jun 15, 2023
Collaborator Author

ManyTheFish
Jul 31, 2023
Collaborator Author

If I type in `大阪`, it returns `大分県, 大阪府`.

If I type in `大阪府`, it returns `京都府, 大分県, 大阪府`.

ManyTheFish Nov 7, 2023
Collaborator Author

ManyTheFish Nov 13, 2023
Collaborator Author

ManyTheFish
Jan 15, 2024
Collaborator Author

ManyTheFish May 27, 2024
Collaborator Author