Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase limit: number of positions (~ words) per attribute #1770

Closed
3 tasks done
curquiza opened this issue Oct 5, 2021 · 19 comments · Fixed by #1847
Closed
3 tasks done

Increase limit: number of positions (~ words) per attribute #1770

curquiza opened this issue Oct 5, 2021 · 19 comments · Fixed by #1847
Assignees
Labels
impacts docs This issue involves changes in the Meilisearch's documentation milli Related to the milli workspace tracking issue Tracks development of a global issue
Milestone

Comments

@curquiza
Copy link
Member

curquiza commented Oct 5, 2021

Related to this tiny spec: meilisearch/specifications#80

The current number of positions per attribute is currently 1000.

See the docs page

This limit will be increased to 65 535.

@meilisearch/docs-team. The whole explanation remains unchanged, only the 1000 word should be replaced by 65 535.


TODO:

@curquiza curquiza added the impacts docs This issue involves changes in the Meilisearch's documentation label Oct 5, 2021
@curquiza curquiza added this to the v0.24.0 milestone Oct 5, 2021
@curquiza curquiza changed the title Increase limit: number of postitions (~ words) per attribute Increase limit: number of positions (~ words) per attribute Oct 5, 2021
@curquiza curquiza added the tracking issue Tracks development of a global issue label Oct 5, 2021
@curquiza
Copy link
Member Author

curquiza commented Oct 5, 2021

⚠️ @meilisearch/docs-team
A warning must be added to the changelogs (release and article) to say the size of the DB (data.ms) can be increased between v0.23.0 and v0.24.0 due to this change!
This addition is considered as impactfull for the users since it can impact the disk usage

@curquiza curquiza added breaking change The related changes are breaking for the users and removed breaking change The related changes are breaking for the users labels Oct 7, 2021
@curquiza
Copy link
Member Author

Milli 0.18.0 is out containing this change 🎉
https://github.com/meilisearch/milli/releases/tag/v0.18.0

@curquiza curquiza added the milli Related to the milli workspace label Oct 18, 2021
@remram44
Copy link

I was a bit surprised to find this in my evaluation of MeiliSearch. 1000 words is a three-page document, that is a very low limit and I was wondering what kind of use-case MeiliSearch was targeting (conversations?).

65535 seems much more reasonable so I am looking forward to this release!

@curquiza
Copy link
Member Author

Hello @remram44!
Thanks for asking this question!
One of the core team developers (@ManyTheFish) already answered it, but in our Slack, so not publically.
I will copy/paste the slack thread here:


Question

What sort of effort, code wise, would it be to remove the 1,000 word field limit? It’s the primary reason I stay on Typesense. I know there are workarounds such as splitting into multiple fields, but I’d just like to understand a bit more, behind the decision to limit it and what sort of architectural changes would be needed to remove it. Thanks.

Answer

I have several answers to this, depending on the view point.

Relevancy

Meilisearch is a search engine, the goal is to return the most relevant documents corresponding to a given search request, and so, we want to keep the most relevant words in each document. The predicate is: "deepest a word is in an attribute, less this word is relevant.".

The current version considers that any words positioned after the position 1000 are too few relevant to be taken into account in the search. Because more words are more noise, raising this limit could lead to a loss of relevancy.

Performances & Memory

Meilisearch has to be the fastest as possible to respond, we pre-compute a lot of things during the indexing of documents. Raising this limit will lead to a bigger disk usage and a longer indexing time. Moreover, because we have more data, the search time could be impacted.

Technical limit

Does Meilisearch have a technical limit?
Yes, but not 1000, the real technical limit should be 65535 (16bits unsigned integer). So we can technically raise this limit to 65535 positions per attribute. \o/

The arbitrary limit of 1000

Why do we have this limit of 1000 positions per attribute?
To be honest, we don't have any proves that 1000 is the optimal limit.
That's why we rethought it, and we raise this limit.

@remram44
Copy link

deepest a word is in an attribute, less this word is relevant

Wow, that is an extremely bad fit for document search. Can I ask where this assumption comes from?

MeiliSearch seems optimized for a use case that I do not understand (if it even exists). What kind of content has progressively decreasing relevance?

@curquiza
Copy link
Member Author

Wow, that is an extremely bad fit for document search. Can I ask where this assumption comes from?

If you think this is not the relevancy you expect, you can remove attribute from the ranking rules.

This is something that some users need. For example with the following dataset:

[
  { "id": 1, "title": "Harry Potter and the Half-Blood Prince", "description": "A story about a wizzars" },
  { "id": 2,  "title": "Fantastic Beasts and Where to Find Them", "descrption": "A movie in the universe of Harry Potter" }
]

If you type harry potter, we consider the first document is more relevant than the second document.
But again, it depends on the relevancy you need, and you can customize it redefining your own ranking rules.

@remram44
Copy link

This seems unrelated, it's about favoring some attributes over other attributes according to an order. Not about favoring some words over other words in a single attribute.

@curquiza
Copy link
Member Author

The depth considered by MeiliSearch is in the same attribute but also between the attributes.

With

[
  { "id": 1, "description": "Harry Potter and his friends live a lof of adventures." },
  { "id": 2, "descrption": "A movie in the universe of Harry Potter" }
]

Doc 1 is considered more relevant than doc 2. My example is really trivial but it can be useful when you have attributes with a lot of words. Again, it depends on your own usecase so if it's something you don't want you can remove attribute from the ranking rules

@remram44
Copy link

You mean removing "words" from the ranking rules?

@curquiza
Copy link
Member Author

curquiza commented Oct 21, 2021

No attribute. I just realized the documentation is not up to date, I've done a PR for a patch: meilisearch/documentation#1222

Sorry for that!

@remram44
Copy link

Why do you have a "words" ranking rule if the ranking by words is controlled by the "attributes" rule?

The more I dig the more MeiliSearch is inscrutable. I would have liked a portable, memory-safe solution but I am staying with TypeSense, nothing in here makes sense to me.

@ManyTheFish
Copy link
Member

ManyTheFish commented Oct 21, 2021

Hello @remram44,

the "words" criterion targets the word of the user query and will remove/ignore the last query words 1 by 1 to fill the response. For instance, we have a query "t-shirt covfefe" requesting 20 documents, because in our imaginary store we only have 1 t-shirt with a covfefe print and we would return 20 documents, the criterion will remove the covfefe word in the query and will rerun a search returning others documents matching the word t-shirt.

The "attribute" criterion will rank documents depending on the position of the matching word in the document:

  • if a match is in a higher attribute, the document is boosted
  • if a match is at the beginning of an attribute, the document is boosted

We could rename this criterion wordPosition but it would be confusing because of the words criterion. Moreover, the weight over the attribute position is more important than the weight of the word position in the attribute.

I hope my explanations were clear, and I encourage you to try meilisearch and see if it can fit your needs despite our weaknesses.

Anyway, Thanks a lot for your feedback!
Don't hesitate to ask more question. 👍

@curquiza curquiza linked a pull request Oct 26, 2021 that will close this issue
@bors bors bot closed this as completed in #1847 Oct 26, 2021
@Sembiance
Copy link

Very excited for this enhancement! It allows me to come back to MeiliSearch (I've been using typesense since I encountered the 1,000 word limit)

One suggestion:
I think it's important for MeiliSearch to notify the user whenever it drops data. Silently dropping data is bad :)

I spent half a day debugging my search queries to try and figure out why I couldn't find a document, turned out it was because MeiliSearch silently dropped all the data beyond the 1,000th word.

When getting the update status, it would be great if the response contained something to show that data was ignored.
Perhaps:

{
  "warnings": [
    { "id": "<document_id>", "truncatedAttributes": [ "<attribute_id>" ] }
  ]
}

@curquiza
Copy link
Member Author

Hello @Sembiance!
Welcome back then 😄
Have you tried v0.24.0rc2? Does it work for you?

I opened a ticket in the product repo so that the product team could take your suggestion of warning into account: meilisearch/product#305

@Sembiance
Copy link

Have you tried v0.24.0rc2? Does it work for you?
Not yet. Currently working on another project, might be a month or so before I can circle back to MeiliSearch.

I opened a ticket in the product repo so that the product team could take your suggestion of warning into account: meilisearch/product#305

Thanks :)

@mikerogerz
Copy link

mikerogerz commented Nov 29, 2021

Hello @remram44,

the "words" criterion targets the word of the user query and will remove/ignore the last query words 1 by 1 to fill the response. For instance, we have a query "t-shirt covfefe" requesting 20 documents, because in our imaginary store we only have 1 t-shirt with a covfefe print and we would return 20 documents, the criterion will remove the covfefe word in the query and will rerun a search returning others documents matching the word t-shirt.

The "attribute" criterion will rank documents depending on the position of the matching word in the document:

  • if a match is in a higher attribute, the document is boosted
  • if a match is at the beginning of an attribute, the document is boosted

We could rename this criterion wordPosition but it would be confusing because of the words criterion. Moreover, the weight over the attribute position is more important than the weight of the word position in the attribute.

I hope my explanations were clear, and I encourage you to try meilisearch and see if it can fit your needs despite our weaknesses.

Anyway, Thanks a lot for your feedback! Don't hesitate to ask more question. +1

@ManyTheFish Thanks a lot for the explanation. I'm not sure if this is useful to anyone else, but this brings up another issue due to "attribute" technically handling two different cases which should be separated into two different rules.

To me it seems like there should be an "attribute" ranking, which is based on the match being in the higher attribute to boost the result, and then another possible "matchPosition" or "wordPosition" within the attribute, which boosts depending on position closer to the beginning of an attribute.

This has become an issue that I can't separate the two rules, since I'm indexing long documents into multiple docs, where the "matchPosition" doesn't matter to me.. where I would technically want to leave it out as a rule (ignoring it), but still sort by important of attribute (such as the title being more important than the description of an article).

Maybe this has already been discussed, but I was unable to find any mention of in the product discussions. It seems like a useful addition/improvement, for those of us indexing long documents into multiple MeiliSearch documents referring to the same piece of content. When indexing these long documents, the position of the match within the attribute is not useful at all, and should be ignored.

@ManyTheFish
Copy link
Member

Hey @mikerogerz! Thanks for your complete response!
I think your point of view is really interesting, and it may be discussed (poke @gmourier).

I think that nothing will be changed before 2022, but we will do at least a "public response" of "why did we choose or not to split the attribute ranking rule in 2?".

Thanks a lot for your feedback! 👍

@curquiza
Copy link
Member Author

curquiza commented Dec 7, 2021

Hello @mikerogerz, I just opened a ticket in the product repository so that you ensure we take into consideration your feedback -> meilisearch/product#329

@mikerogerz
Copy link

Hey @curquiza I really appreciate it. I'll follow that discussion to see how it progresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacts docs This issue involves changes in the Meilisearch's documentation milli Related to the milli workspace tracking issue Tracks development of a global issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants