Japanese Language support #532
Replies: 18 comments 52 replies
-
@ManyTheFish Thank you for writing the details.
It's fine. This is enough for Japanese to understand 👍 Language detection only with Kanji/Hanzi stringsI researched various things to see if whatlang could handle it, but it might be better not to expect too much. I'm just thinking it might be better to think of another way. NormalizationI think Unicode NKFC is enough for Japanese normalization. ( e.g. https://github.com/unicode-rs/unicode-normalization ) It's convert...
IndexingI am very happy to be able to do ambiguous searches in hiragana, so I agree. Since Lindera holds Pronounce by Katakana, I feel that it can be achieved by devising the indexing. About the Japanese input method.This is a supplement in the hope that it will be of some help.
There are two input methods for Japanese: "Romaji" and "Kana".
These can switch by Japanese IME option. Most Japanese choose romaji input because it is easy to learn the keyboard layout. |
Beta Was this translation helpful? Give feedback.
-
Hello all!
All these issues are open to external contributions during the whole month, so don't hesitate to contribute! 🧑💻 This is another step in enhancing Japanese Language support, depending on future feedback, we will be able to go further. Thanks for all your feedback! ✍️ 🇯🇵 |
Beta Was this translation helpful? Give feedback.
-
Hi all, I also put here the comment I wrote on meilisearch/charabia#139 . This character normalization seems to be performed after tokenization, but in some cases, it is better to perform character normalization before tokenization in Japanese. For example, this is a case where there is no problem even after tokenization:
Half-width But the following cases can be problematic.
Since full-width numbers are already registered in the morphological dictionary (IPADIC), each number becomes a single token, so a full-width If possible, I would like you to consider a way to perform character normalization before tokenization. |
Beta Was this translation helpful? Give feedback.
-
@ManyTheFish I hope it will be of some help to you. https://speakerdeck.com/mosuka/the-importance-of-morphological-analysis-in-japanese-search-engines |
Beta Was this translation helpful? Give feedback.
-
I have published a simple application that I had made to confirm that Meilisearch works in Japanese. |
Beta Was this translation helpful? Give feedback.
-
Hello people! The current behaviorLanguage DetectionToday, we are using whatlang-rs to detect the Script and the Language in a text, Language detection is really important for Japanese Language support mainly to make the difference with the Chinese Language when only Kanjis are used in a text, for example, a small title or a query. SegmentationTo segment Japanese text, we are currently using Lindera based on a Viterbi algorithm using the IPA dictionaries. Thanks to @mosuka for maintaining it. (a small explanation of Japanese segmentation) NormalizationSo far, we only normalize Japanese characters by replacing them with their decomposed compatible form, to give an example, half-width kanas are converted into kanas. To know more about this, I put below some documentation about it below: The remaining issues we should tackle in the future
PrototypesThere is a prototype of Meilisearch deactivating completely the Chinese support, this way we avoid Language detection mistakes, in addition, this prototype activates the katakana-to-hiragana conversion, if you want to try this prototype I put the link to it: Thanks! |
Beta Was this translation helpful? Give feedback.
-
Handling of Proper Nouns in JapaneseIs the issue of not being able to search for proper nouns that are not in ipadic already being discussed, such as Chinese language support, or etc? ref: misskey-dev/misskey/issues/10845 Target Contents
Search Word:
|
Beta Was this translation helpful? Give feedback.
-
Hello everyone 👋 An update on Meilisearch and the Japanese support New release V1.3 🦁v1.3 has been released today 🦁 including a change in the Japanese segmentation, Meilisearch now relies on UniDic instead of IPADIC to segment Japanese words which should increase the amount of document retrieved by Meilisearch. We still encounter difficulties when a dataset contains small documents with kanji-only fields, if you don't manage to retrieve documents containing kanji-only fields with the official Meilisearch version, please try the Japanese specialized docker image that deactivates other Language support. A preview of V1.4 👀We just released a 🧪 prototype that allows the users to customize how Meilisearch tokenizes documents and queries, and we'd love your feedback.
How to get the prototype?Using docker, use the following command:
From source, compile Meilisearch on the How to use the prototype?You can find some examples below, or look at the original PR for more info. We know that the Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️ |
Beta Was this translation helpful? Give feedback.
-
Facet Search is not working as I expected in Japanese.I am trying Facet Search on a Japanese demo site I have published, but it doesn't seem to work the way I want it to. I am trying to narrow down the prefectures... the example of Osaka-fu is easy to understand. ( Meilisearch version: prototype-japanese-5 ) If I type in
|
Beta Was this translation helpful? Give feedback.
-
Hi! Thank you guys for your fantastic work on improving Japanese support. I tried the Docker image meilisearch/prototype-japanese-7, which works really well in my case (Drupal search API + Meilisearch backend). I have two questions:
My apologies if this is not the right place to ask. Thank you for any input! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the Great Tools. My search query was '大学' and it is showing results for '大垣' as well which doesn't make sense. |
Beta Was this translation helpful? Give feedback.
-
こんにちは!
|
Beta Was this translation helpful? Give feedback.
-
Hello All,
If you want more information about the last release, I put below the link to it: Below is the PR with all the Japanese Language-specialized docker images: See you! |
Beta Was this translation helpful? Give feedback.
-
Specify an user dictionary for Japanese: I am impressed with the creation of such a wonderful search engine! 😀 Japanese does not perform word segmentation using spaces, so a good dictionary is necessary to determine word boundaries. Particularly for new words or proper nouns, registration of words in a user dictionary may be required. Lindera has a feature to specify a user dictionary, If this is appropriate, I would like to make PR with it. |
Beta Was this translation helpful? Give feedback.
-
Hello All,
If you want more information about the last release, I put below the link to it: Below is the PR with all the Japanese Language-specialized docker images: See you! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the great support & great software We started using Is there anyway to get back this features ? |
Beta Was this translation helpful? Give feedback.
-
For those who try Meilisearch for the first time: you should try v1.10.2 (or any later versions) with
|
Beta Was this translation helpful? Give feedback.
-
I would like to enable Lindera's character filters and token filters in Charabia. I am thinking that if I don't create a Charabia token at this time based on values other than text recorded by Lindera's token, the term will be out of position due to highlighting, etc. I made it possible to describe Lindera settings in YAML, we would be very happy if this could be accomplished, allowing Japanese-specific string handling to be configured by out of Meilisearch using environment variable. Is there a better way to do this? |
Beta Was this translation helpful? Give feedback.
-
Japanese Languages support
Current behavior, pointed out issues, and Possible enhancement
Language Detection
Current behavior
Meilisearch Language detection is handled by an external library named whatlang, then, depending on the detected Script and Language, a specialized segmenter and specialized Normalizers are chosen to tokenize the provided text.
related to:
Possible enhancement
Segmentation
Meilisearch Japanese Segmentation is handled by an external library named lindera.
Normalization
Currently, there is no specialized normalization for Japanese.
Possible enhancement
We could normalize Japanese words by converting them into Hiragana, this could increase the recall of Meilisearch because:
Katakana
andKanji
characters are written inHiragana
by the user, then the computer will suggest aKatakana
or aKanji
version of the written text.Troubleshooting 🆘
A Query containing Kanjis doesn't retrieve all the relevant documents
When doing a search query with only
Kanji
characters, the language detection doesn't classify the query as a Japanese one but as a Chinese one because:Kanji
s is a set of traditional Chinese characters used in Japanese and some are used in both LanguagesWorkaround
The only workaround is to use a specialized Meilisearch version that deactivates the Chinese Language support. Below is the link to the PR containing all the released versions:
meilisearch/meilisearch#3882
Possible fixes
Contribute!
In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
However, if you are a native speaker, don't hesitate to contribute to enhancing this experience:
Thanks for your help!
Beta Was this translation helpful? Give feedback.
All reactions