-
Notifications
You must be signed in to change notification settings - Fork 268
v0.28 hebrew tokenizer #1728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
v0.28 hebrew tokenizer #1728
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
d6312cf
Rewrite language.md for hebrew
dichotommy 85227a7
Update tokenization page (four total pipelines
dichotommy ab5c36a
Be less vague
dichotommy e69848e
Remove unnecessary bold
dichotommy 943d1af
Remove outdated tokenizer image + link to contributing.md
dichotommy dc61d8f
Apply suggestions from code review
maryamsulemani97 0425a22
update based on ManyTheFish's review
maryamsulemani97 29b9d16
Update learn/advanced/tokenization.md
maryamsulemani97 b07336f
Update learn/what_is_meilisearch/language.md
maryamsulemani97 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,42 +1,43 @@ | ||
# Language | ||
|
||
**Meilisearch is multilingual**, featuring optimized support for: | ||
Meilisearch is multilingual, featuring optimized support for: | ||
|
||
- **Any language that uses whitespace to separate words** | ||
- **Chinese** (through [Jieba](https://github.com/messense/jieba-rs)) | ||
- **Japanese** (through [Lindera](https://github.com/lindera-morphology/lindera)) | ||
- Any language that uses whitespace to separate words | ||
- Chinese | ||
- Japanese | ||
- Hebrew | ||
|
||
We aim to provide global language support, and your feedback helps us [move closer to that goal](#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/meilisearch/issues/new/choose). | ||
We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please [open an issue in our tokenizer repo](https://github.com/meilisearch/charabia/issues/new). | ||
|
||
If you'd like to learn more about how different languages are processed in Meilisearch, see our [tokenizer documentation](/learn/advanced/tokenization.md). | ||
[Read more about our tokenizer](/learn/advanced/tokenization.md) | ||
|
||
## Improving our language support | ||
|
||
While we have employees from all over the world at Meilisearch, we don't speak every language. In fact, we rely almost entirely on feedback from external contributors to know how our engine is performing across different languages. | ||
While we have employees from all over the world at Meilisearch, we don't speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages. | ||
|
||
If you'd like to help us create a more global Meilisearch, please consider sharing your tests, results, and general feedback with us through [GitHub issues](https://github.com/meilisearch/Meilisearch/issues). Here are some of the languages that have been requested by users and their corresponding issue: | ||
If you'd like to request optimized support for a language that we don't currently support, please upvote the related [discussion in our product repository](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) or [open a new one](https://github.com/meilisearch/product/discussions/new?category=feedback-feature-proposal) if it doesn't exist. | ||
|
||
- [Arabic](https://github.com/meilisearch/meilisearch/issues/554) | ||
- [Lao](https://github.com/meilisearch/meilisearch/issues/563) | ||
- [Persian/Farsi](https://github.com/meilisearch/meilisearch/issues/553) | ||
- [Thai](https://github.com/meilisearch/meilisearch/issues/864) | ||
|
||
If you'd like us to add or improve support for a language that isn't in the above list, please create an [issue](https://github.com/meilisearch/meilisearch/issues/new?assignees=&labels=&template=feature_request.md&title=) saying so, and then make a [pull request on the documentation](https://github.com/meilisearch/documentation/edit/master/reference/features/language.md) to add it to the above list. | ||
If you'd like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before making a PR. | ||
|
||
## FAQ | ||
|
||
### What do you mean when you say Meilisearch offers _optimized_ support for a language? | ||
|
||
Under the hood, Meilisearch relies on tokenizers that identify the most important parts of each document in a given dataset. We currently use two tokenization pipelines: one for languages that separate words with spaces and one specifically tailored for Chinese. Languages that delimit their words in other ways will still work, but the quality and relevancy of search results may vary significantly. | ||
Under the hood, Meilisearch relies on tokenizers that identify the most important parts of each document in a given dataset. We currently use four tokenization pipelines: | ||
|
||
- A default pipeline designed for languages that separate words with spaces | ||
- A pipeline specifically tailored for Chinese | ||
- A pipeline specifically tailored for Japanese | ||
- A pipeline specifically tailored for Hebrew | ||
|
||
### My language does not use whitespace to separate words. Can I still use Meilisearch? | ||
|
||
Yes, but your experience might not be optimized and results might be less relevant than in whitespace-separated languages and Chinese. | ||
Yes, but search results might be less relevant than in one of the fully optimized languages. | ||
|
||
### My language does not use the Roman alphabet. Can I still use Meilisearch? | ||
|
||
Yes—our users work with many different alphabets and writing systems such as Cyrillic, Thai, and Japanese. | ||
Yes—our users work with many different alphabets and writing systems, such as Cyrillic, Thai, and Japanese. | ||
|
||
### Does Meilisearch plan to support additional languages in the future? | ||
|
||
Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages—and the more requests to improve support for a specific language, the more likely we are to devote resources to that project. | ||
Yes, we definitely do. The more [feedback](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.