From d6312cfbec8c2352d4b69bd759d04af7f3e497f0 Mon Sep 17 00:00:00 2001 From: Tommy Melvin Date: Tue, 14 Jun 2022 18:56:44 +0200 Subject: [PATCH 1/9] Rewrite language.md for hebrew --- learn/what_is_meilisearch/language.md | 32 +++++++++++++-------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/learn/what_is_meilisearch/language.md b/learn/what_is_meilisearch/language.md index 38c9bf3e56..125b559555 100644 --- a/learn/what_is_meilisearch/language.md +++ b/learn/what_is_meilisearch/language.md @@ -1,42 +1,42 @@ # Language -**Meilisearch is multilingual**, featuring optimized support for: +Meilisearch is multilingual, featuring optimized support for: - **Any language that uses whitespace to separate words** - **Chinese** (through [Jieba](https://github.com/messense/jieba-rs)) - **Japanese** (through [Lindera](https://github.com/lindera-morphology/lindera)) +- **Hebrew** (with normalization through [Niqqud](https://docs.rs/niqqud/latest/niqqud/)) -We aim to provide global language support, and your feedback helps us [move closer to that goal](#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/meilisearch/issues/new/choose). +We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please [open an issue in our tokenizer repo](https://github.com/meilisearch/charabia/issues/new). -If you'd like to learn more about how different languages are processed in Meilisearch, see our [tokenizer documentation](/learn/advanced/tokenization.md). +[Read more about our tokenizer](/learn/advanced/tokenization.md) ## Improving our language support -While we have employees from all over the world at Meilisearch, we don't speak every language. In fact, we rely almost entirely on feedback from external contributors to know how our engine is performing across different languages. +While we have employees from all over the world at Meilisearch, we don't speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages. -If you'd like to help us create a more global Meilisearch, please consider sharing your tests, results, and general feedback with us through [GitHub issues](https://github.com/meilisearch/Meilisearch/issues). Here are some of the languages that have been requested by users and their corresponding issue: - -- [Arabic](https://github.com/meilisearch/meilisearch/issues/554) -- [Lao](https://github.com/meilisearch/meilisearch/issues/563) -- [Persian/Farsi](https://github.com/meilisearch/meilisearch/issues/553) -- [Thai](https://github.com/meilisearch/meilisearch/issues/864) - -If you'd like us to add or improve support for a language that isn't in the above list, please create an [issue](https://github.com/meilisearch/meilisearch/issues/new?assignees=&labels=&template=feature_request.md&title=) saying so, and then make a [pull request on the documentation](https://github.com/meilisearch/documentation/edit/master/reference/features/language.md) to add it to the above list. +- If you'd like to request dedicated support for a language but aren't able to work on a tokenization pipeline yourself, please [open a discussion in our product repo](https://github.com/meilisearch/product/discussions). +- If you are interested in contributing to the Meilisearch tokenizer directly, please have a look at the [contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before doing so. ## FAQ ### What do you mean when you say Meilisearch offers _optimized_ support for a language? -Under the hood, Meilisearch relies on tokenizers that identify the most important parts of each document in a given dataset. We currently use two tokenization pipelines: one for languages that separate words with spaces and one specifically tailored for Chinese. Languages that delimit their words in other ways will still work, but the quality and relevancy of search results may vary significantly. +Under the hood, Meilisearch relies on tokenizers that identify the most important parts of each document in a given dataset. We currently use four tokenization pipelines: + +- A default one designed for languages that separate words with spaces +- One specifically tailored for Chinese +- One specifically tailored for Japanese +- One specifically tailored for Hebrew ### My language does not use whitespace to separate words. Can I still use Meilisearch? -Yes, but your experience might not be optimized and results might be less relevant than in whitespace-separated languages and Chinese. +Yes, but search results might be less relevant than in one of the fully optimized languages. ### My language does not use the Roman alphabet. Can I still use Meilisearch? -Yes—our users work with many different alphabets and writing systems such as Cyrillic, Thai, and Japanese. +Yes—our users work with many different alphabets and writing systems, such as Cyrillic, Thai, and Japanese. ### Does Meilisearch plan to support additional languages in the future? -Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages—and the more requests to improve support for a specific language, the more likely we are to devote resources to that project. +Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project. From 85227a7a934961b6448184772b4773b1232e5971 Mon Sep 17 00:00:00 2001 From: Tommy Melvin Date: Wed, 15 Jun 2022 16:31:34 +0200 Subject: [PATCH 2/9] Update tokenization page (four total pipelines Also reduce technical complexity of language page (no need to mention specific normalizers + segmenters) --- learn/advanced/tokenization.md | 8 +++++--- learn/what_is_meilisearch/language.md | 11 ++++++----- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/learn/advanced/tokenization.md b/learn/advanced/tokenization.md index acf63d63be..3a17cdcba2 100644 --- a/learn/advanced/tokenization.md +++ b/learn/advanced/tokenization.md @@ -17,9 +17,11 @@ We can break down the tokenization process like so: 1. Crawl the document(s) and determine the primary language for each field 2. Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists -Pipelines include many language-specific operations. Currently, we have two pipelines: +Pipelines include many language-specific operations. Currently, we have four pipelines: -1. A specialized Chinese pipeline using [Jieba](https://github.com/messense/jieba-rs) -2. A default Meilisearch pipeline that separates words based on categories. Works with a variety of languages +1. A default Meilisearch pipeline for languages that use whitespace to separate words. Uses [unicode segmenter](https://github.com/unicode-rs/unicode-segmentation) +2. A specialized Chinese pipeline using [Jieba](https://github.com/messense/jieba-rs) +3. A specialized Japanese pipeline using [Lindera](https://github.com/lindera-morphology/lindera) +4. A specialized Hebrew pipeline based off the default Meilisearch pipeline. Uses [Niqqud](https://docs.rs/niqqud/latest/niqqud/) for normalization For more details, check out the [feature specification](https://github.com/meilisearch/specifications/blob/master/text/0001-script-based-tokenizer.md). diff --git a/learn/what_is_meilisearch/language.md b/learn/what_is_meilisearch/language.md index 125b559555..5e7521d818 100644 --- a/learn/what_is_meilisearch/language.md +++ b/learn/what_is_meilisearch/language.md @@ -3,9 +3,9 @@ Meilisearch is multilingual, featuring optimized support for: - **Any language that uses whitespace to separate words** -- **Chinese** (through [Jieba](https://github.com/messense/jieba-rs)) -- **Japanese** (through [Lindera](https://github.com/lindera-morphology/lindera)) -- **Hebrew** (with normalization through [Niqqud](https://docs.rs/niqqud/latest/niqqud/)) +- **Chinese** +- **Japanese** +- **Hebrew** We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please [open an issue in our tokenizer repo](https://github.com/meilisearch/charabia/issues/new). @@ -15,8 +15,9 @@ We aim to provide global language support, and your feedback helps us move close While we have employees from all over the world at Meilisearch, we don't speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages. -- If you'd like to request dedicated support for a language but aren't able to work on a tokenization pipeline yourself, please [open a discussion in our product repo](https://github.com/meilisearch/product/discussions). -- If you are interested in contributing to the Meilisearch tokenizer directly, please have a look at the [contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before doing so. +If you'd like to request optimized support for a language that we don't currently support, please [open a discussion in our product repository](https://github.com/meilisearch/product/discussions). + +If you'd like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before making a PR. ## FAQ From ab5c36a688113c71bd4e425582401344553f48e3 Mon Sep 17 00:00:00 2001 From: Tommy Melvin Date: Thu, 23 Jun 2022 18:41:53 +0200 Subject: [PATCH 3/9] Be less vague Co-authored w/ Gui Machiavelli --- learn/what_is_meilisearch/language.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/learn/what_is_meilisearch/language.md b/learn/what_is_meilisearch/language.md index 5e7521d818..266abcc059 100644 --- a/learn/what_is_meilisearch/language.md +++ b/learn/what_is_meilisearch/language.md @@ -25,10 +25,10 @@ If you'd like to help by developing a tokenizer pipeline yourself: first of all, Under the hood, Meilisearch relies on tokenizers that identify the most important parts of each document in a given dataset. We currently use four tokenization pipelines: -- A default one designed for languages that separate words with spaces -- One specifically tailored for Chinese -- One specifically tailored for Japanese -- One specifically tailored for Hebrew +- A default pipeline designed for languages that separate words with spaces +- A pipeline specifically tailored for Chinese +- A pipeline specifically tailored for Japanese +- A pipeline specifically tailored for Hebrew ### My language does not use whitespace to separate words. Can I still use Meilisearch? From e69848edfd9a8c0cb5413887926a29888905723f Mon Sep 17 00:00:00 2001 From: Tommy Melvin Date: Thu, 23 Jun 2022 18:44:29 +0200 Subject: [PATCH 4/9] Remove unnecessary bold --- learn/what_is_meilisearch/language.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/learn/what_is_meilisearch/language.md b/learn/what_is_meilisearch/language.md index 266abcc059..e6ce5c5459 100644 --- a/learn/what_is_meilisearch/language.md +++ b/learn/what_is_meilisearch/language.md @@ -2,10 +2,10 @@ Meilisearch is multilingual, featuring optimized support for: -- **Any language that uses whitespace to separate words** -- **Chinese** -- **Japanese** -- **Hebrew** +- Any language that uses whitespace to separate words +- Chinese +- Japanese +- Hebrew We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please [open an issue in our tokenizer repo](https://github.com/meilisearch/charabia/issues/new). From 943d1afa6b0e63477c425b4729a087461b5676dc Mon Sep 17 00:00:00 2001 From: Tommy Melvin Date: Thu, 23 Jun 2022 18:46:54 +0200 Subject: [PATCH 5/9] Remove outdated tokenizer image + link to contributing.md --- learn/advanced/tokenization.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/learn/advanced/tokenization.md b/learn/advanced/tokenization.md index 3a17cdcba2..67d5b4d7d7 100644 --- a/learn/advanced/tokenization.md +++ b/learn/advanced/tokenization.md @@ -8,8 +8,6 @@ This allows Meilisearch to function in several different languages with zero set ## Deep dive: The Meilisearch tokenizer -![Chart illustrating the architecture of Meilisearch's tokenizer](https://user-images.githubusercontent.com/6482087/102896344-8560d200-4466-11eb-8cfe-b4ae8741093b.jpg) - When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an **analyzer**. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding **pipeline** to each field. We can break down the tokenization process like so: @@ -24,4 +22,4 @@ Pipelines include many language-specific operations. Currently, we have four pip 3. A specialized Japanese pipeline using [Lindera](https://github.com/lindera-morphology/lindera) 4. A specialized Hebrew pipeline based off the default Meilisearch pipeline. Uses [Niqqud](https://docs.rs/niqqud/latest/niqqud/) for normalization -For more details, check out the [feature specification](https://github.com/meilisearch/specifications/blob/master/text/0001-script-based-tokenizer.md). +For more details, check out the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md). From dc61d8ff1ac7be3037412e436a97b6dd94b562a2 Mon Sep 17 00:00:00 2001 From: Maryam <90181761+maryamsulemani97@users.noreply.github.com> Date: Tue, 5 Jul 2022 16:32:34 +0400 Subject: [PATCH 6/9] Apply suggestions from code review Co-authored-by: Many the fish --- learn/what_is_meilisearch/language.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learn/what_is_meilisearch/language.md b/learn/what_is_meilisearch/language.md index e6ce5c5459..99191d9d66 100644 --- a/learn/what_is_meilisearch/language.md +++ b/learn/what_is_meilisearch/language.md @@ -15,7 +15,7 @@ We aim to provide global language support, and your feedback helps us move close While we have employees from all over the world at Meilisearch, we don't speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages. -If you'd like to request optimized support for a language that we don't currently support, please [open a discussion in our product repository](https://github.com/meilisearch/product/discussions). +If you'd like to request optimized support for a language that we don't currently support, please upvote the related [discussion in our product repository](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) or [open a new one](https://github.com/meilisearch/product/discussions/new) if it doesn't exist. If you'd like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before making a PR. From 0425a2230b259987e7ae0e56c82badf731984c7e Mon Sep 17 00:00:00 2001 From: Maryam Sulemani Date: Tue, 5 Jul 2022 16:57:03 +0400 Subject: [PATCH 7/9] update based on ManyTheFish's review --- learn/advanced/tokenization.md | 6 +++--- learn/what_is_meilisearch/language.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/learn/advanced/tokenization.md b/learn/advanced/tokenization.md index 67d5b4d7d7..cf9db7c695 100644 --- a/learn/advanced/tokenization.md +++ b/learn/advanced/tokenization.md @@ -8,12 +8,12 @@ This allows Meilisearch to function in several different languages with zero set ## Deep dive: The Meilisearch tokenizer -When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an **analyzer**. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding **pipeline** to each field. +When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called the tokenizer. The tokenizer is responsible for splitting each field by script (e.g., Latin alphabet, Chinese hanzi, etc.). It then applies the corresponding pipeline to each part of each field. We can break down the tokenization process like so: -1. Crawl the document(s) and determine the primary language for each field -2. Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists +1. Crawl the document(s), splitting each field by script +2. Go back over the documents part-by-part, running the corresponding tokenization pipeline, if it exists Pipelines include many language-specific operations. Currently, we have four pipelines: diff --git a/learn/what_is_meilisearch/language.md b/learn/what_is_meilisearch/language.md index 99191d9d66..a515fd0c9b 100644 --- a/learn/what_is_meilisearch/language.md +++ b/learn/what_is_meilisearch/language.md @@ -40,4 +40,4 @@ Yes—our users work with many different alphabets and writing systems, such as ### Does Meilisearch plan to support additional languages in the future? -Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project. +Yes, we definitely do. The more [feedback](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project. From 29b9d16da42828960034b7d1860e75a3ba8a6a59 Mon Sep 17 00:00:00 2001 From: Maryam <90181761+maryamsulemani97@users.noreply.github.com> Date: Wed, 6 Jul 2022 16:54:51 +0400 Subject: [PATCH 8/9] Update learn/advanced/tokenization.md Co-authored-by: gui machiavelli --- learn/advanced/tokenization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learn/advanced/tokenization.md b/learn/advanced/tokenization.md index cf9db7c695..8811f136a5 100644 --- a/learn/advanced/tokenization.md +++ b/learn/advanced/tokenization.md @@ -8,7 +8,7 @@ This allows Meilisearch to function in several different languages with zero set ## Deep dive: The Meilisearch tokenizer -When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called the tokenizer. The tokenizer is responsible for splitting each field by script (e.g., Latin alphabet, Chinese hanzi, etc.). It then applies the corresponding pipeline to each part of each field. +When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called the tokenizer. The tokenizer is responsible for splitting each field by writing system (e.g. Latin alphabet, Chinese hanzi). It then applies the corresponding pipeline to each part of each document field. We can break down the tokenization process like so: From b07336f47dcc5062da2e3ae3c777368104a45e46 Mon Sep 17 00:00:00 2001 From: Maryam <90181761+maryamsulemani97@users.noreply.github.com> Date: Thu, 7 Jul 2022 15:48:49 +0400 Subject: [PATCH 9/9] Update learn/what_is_meilisearch/language.md Co-authored-by: Many the fish --- learn/what_is_meilisearch/language.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learn/what_is_meilisearch/language.md b/learn/what_is_meilisearch/language.md index a515fd0c9b..a388d3fb7a 100644 --- a/learn/what_is_meilisearch/language.md +++ b/learn/what_is_meilisearch/language.md @@ -15,7 +15,7 @@ We aim to provide global language support, and your feedback helps us move close While we have employees from all over the world at Meilisearch, we don't speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages. -If you'd like to request optimized support for a language that we don't currently support, please upvote the related [discussion in our product repository](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) or [open a new one](https://github.com/meilisearch/product/discussions/new) if it doesn't exist. +If you'd like to request optimized support for a language that we don't currently support, please upvote the related [discussion in our product repository](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) or [open a new one](https://github.com/meilisearch/product/discussions/new?category=feedback-feature-proposal) if it doesn't exist. If you'd like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before making a PR.