Increase limit: number of positions (~ words) per attribute #1770

curquiza · 2021-10-05T15:27:50Z

Related to this tiny spec: meilisearch/specifications#80

The current number of positions per attribute is currently 1000.

See the docs page

This limit will be increased to 65 535.

@meilisearch/docs-team. The whole explanation remains unchanged, only the 1000 word should be replaced by 65 535.

TODO:

Merge this PR: Remove limit of 1000 position per attribute milli#368
Release milli
Update the milli dependency in this repo

The text was updated successfully, but these errors were encountered:

curquiza · 2021-10-05T15:35:06Z

⚠️ @meilisearch/docs-team
A warning must be added to the changelogs (release and article) to say the size of the DB (data.ms) can be increased between v0.23.0 and v0.24.0 due to this change!
This addition is considered as impactfull for the users since it can impact the disk usage

curquiza · 2021-10-18T21:47:53Z

Milli 0.18.0 is out containing this change 🎉
https://github.com/meilisearch/milli/releases/tag/v0.18.0

remram44 · 2021-10-19T15:07:36Z

I was a bit surprised to find this in my evaluation of MeiliSearch. 1000 words is a three-page document, that is a very low limit and I was wondering what kind of use-case MeiliSearch was targeting (conversations?).

65535 seems much more reasonable so I am looking forward to this release!

curquiza · 2021-10-20T10:56:15Z

Hello @remram44!
Thanks for asking this question!
One of the core team developers (@ManyTheFish) already answered it, but in our Slack, so not publically.
I will copy/paste the slack thread here:

Question

What sort of effort, code wise, would it be to remove the 1,000 word field limit? It’s the primary reason I stay on Typesense. I know there are workarounds such as splitting into multiple fields, but I’d just like to understand a bit more, behind the decision to limit it and what sort of architectural changes would be needed to remove it. Thanks.

Answer

I have several answers to this, depending on the view point.

Relevancy

Meilisearch is a search engine, the goal is to return the most relevant documents corresponding to a given search request, and so, we want to keep the most relevant words in each document. The predicate is: "deepest a word is in an attribute, less this word is relevant.".

The current version considers that any words positioned after the position 1000 are too few relevant to be taken into account in the search. Because more words are more noise, raising this limit could lead to a loss of relevancy.

Performances & Memory

Meilisearch has to be the fastest as possible to respond, we pre-compute a lot of things during the indexing of documents. Raising this limit will lead to a bigger disk usage and a longer indexing time. Moreover, because we have more data, the search time could be impacted.

Technical limit

Does Meilisearch have a technical limit?
Yes, but not 1000, the real technical limit should be 65535 (16bits unsigned integer). So we can technically raise this limit to 65535 positions per attribute. \o/

The arbitrary limit of 1000

Why do we have this limit of 1000 positions per attribute?
To be honest, we don't have any proves that 1000 is the optimal limit.
That's why we rethought it, and we raise this limit.

remram44 · 2021-10-20T12:48:41Z

deepest a word is in an attribute, less this word is relevant

Wow, that is an extremely bad fit for document search. Can I ask where this assumption comes from?

MeiliSearch seems optimized for a use case that I do not understand (if it even exists). What kind of content has progressively decreasing relevance?

curquiza · 2021-10-21T15:14:25Z

Wow, that is an extremely bad fit for document search. Can I ask where this assumption comes from?

If you think this is not the relevancy you expect, you can remove attribute from the ranking rules.

This is something that some users need. For example with the following dataset:

[
  { "id": 1, "title": "Harry Potter and the Half-Blood Prince", "description": "A story about a wizzars" },
  { "id": 2,  "title": "Fantastic Beasts and Where to Find Them", "descrption": "A movie in the universe of Harry Potter" }
]

If you type harry potter, we consider the first document is more relevant than the second document.
But again, it depends on the relevancy you need, and you can customize it redefining your own ranking rules.

remram44 · 2021-10-21T15:21:02Z

This seems unrelated, it's about favoring some attributes over other attributes according to an order. Not about favoring some words over other words in a single attribute.

curquiza · 2021-10-21T15:34:30Z

The depth considered by MeiliSearch is in the same attribute but also between the attributes.

With

[
  { "id": 1, "description": "Harry Potter and his friends live a lof of adventures." },
  { "id": 2, "descrption": "A movie in the universe of Harry Potter" }
]

Doc 1 is considered more relevant than doc 2. My example is really trivial but it can be useful when you have attributes with a lot of words. Again, it depends on your own usecase so if it's something you don't want you can remove attribute from the ranking rules

remram44 · 2021-10-21T15:47:00Z

You mean removing "words" from the ranking rules?

curquiza · 2021-10-21T16:00:21Z

No attribute. I just realized the documentation is not up to date, I've done a PR for a patch: meilisearch/documentation#1222

Sorry for that!

remram44 · 2021-10-21T16:21:55Z

Why do you have a "words" ranking rule if the ranking by words is controlled by the "attributes" rule?

The more I dig the more MeiliSearch is inscrutable. I would have liked a portable, memory-safe solution but I am staying with TypeSense, nothing in here makes sense to me.

ManyTheFish · 2021-10-21T17:07:20Z

Hello @remram44,

the "words" criterion targets the word of the user query and will remove/ignore the last query words 1 by 1 to fill the response. For instance, we have a query "t-shirt covfefe" requesting 20 documents, because in our imaginary store we only have 1 t-shirt with a covfefe print and we would return 20 documents, the criterion will remove the covfefe word in the query and will rerun a search returning others documents matching the word t-shirt.

The "attribute" criterion will rank documents depending on the position of the matching word in the document:

if a match is in a higher attribute, the document is boosted
if a match is at the beginning of an attribute, the document is boosted

We could rename this criterion wordPosition but it would be confusing because of the words criterion. Moreover, the weight over the attribute position is more important than the weight of the word position in the attribute.

I hope my explanations were clear, and I encourage you to try meilisearch and see if it can fit your needs despite our weaknesses.

Anyway, Thanks a lot for your feedback!
Don't hesitate to ask more question. 👍

Sembiance · 2021-11-10T18:41:31Z

Very excited for this enhancement! It allows me to come back to MeiliSearch (I've been using typesense since I encountered the 1,000 word limit)

One suggestion:
I think it's important for MeiliSearch to notify the user whenever it drops data. Silently dropping data is bad :)

I spent half a day debugging my search queries to try and figure out why I couldn't find a document, turned out it was because MeiliSearch silently dropped all the data beyond the 1,000th word.

When getting the update status, it would be great if the response contained something to show that data was ignored.
Perhaps:

{
  "warnings": [
    { "id": "<document_id>", "truncatedAttributes": [ "<attribute_id>" ] }
  ]
}

curquiza · 2021-11-11T13:04:58Z

Hello @Sembiance!
Welcome back then 😄
Have you tried v0.24.0rc2? Does it work for you?

I opened a ticket in the product repo so that the product team could take your suggestion of warning into account: meilisearch/product#305

Sembiance · 2021-11-11T13:13:06Z

Have you tried v0.24.0rc2? Does it work for you?
Not yet. Currently working on another project, might be a month or so before I can circle back to MeiliSearch.

I opened a ticket in the product repo so that the product team could take your suggestion of warning into account: meilisearch/product#305

Thanks :)

mikerogerz · 2021-11-29T17:12:17Z

Hello @remram44,

the "words" criterion targets the word of the user query and will remove/ignore the last query words 1 by 1 to fill the response. For instance, we have a query "t-shirt covfefe" requesting 20 documents, because in our imaginary store we only have 1 t-shirt with a covfefe print and we would return 20 documents, the criterion will remove the covfefe word in the query and will rerun a search returning others documents matching the word t-shirt.

The "attribute" criterion will rank documents depending on the position of the matching word in the document:

if a match is in a higher attribute, the document is boosted

if a match is at the beginning of an attribute, the document is boosted

We could rename this criterion wordPosition but it would be confusing because of the words criterion. Moreover, the weight over the attribute position is more important than the weight of the word position in the attribute.

I hope my explanations were clear, and I encourage you to try meilisearch and see if it can fit your needs despite our weaknesses.

Anyway, Thanks a lot for your feedback! Don't hesitate to ask more question. +1

@ManyTheFish Thanks a lot for the explanation. I'm not sure if this is useful to anyone else, but this brings up another issue due to "attribute" technically handling two different cases which should be separated into two different rules.

To me it seems like there should be an "attribute" ranking, which is based on the match being in the higher attribute to boost the result, and then another possible "matchPosition" or "wordPosition" within the attribute, which boosts depending on position closer to the beginning of an attribute.

This has become an issue that I can't separate the two rules, since I'm indexing long documents into multiple docs, where the "matchPosition" doesn't matter to me.. where I would technically want to leave it out as a rule (ignoring it), but still sort by important of attribute (such as the title being more important than the description of an article).

Maybe this has already been discussed, but I was unable to find any mention of in the product discussions. It seems like a useful addition/improvement, for those of us indexing long documents into multiple MeiliSearch documents referring to the same piece of content. When indexing these long documents, the position of the match within the attribute is not useful at all, and should be ignored.

ManyTheFish · 2021-12-06T09:38:26Z

Hey @mikerogerz! Thanks for your complete response!
I think your point of view is really interesting, and it may be discussed (poke @gmourier).

I think that nothing will be changed before 2022, but we will do at least a "public response" of "why did we choose or not to split the attribute ranking rule in 2?".

Thanks a lot for your feedback! 👍

curquiza · 2021-12-07T18:02:19Z

Hello @mikerogerz, I just opened a ticket in the product repository so that you ensure we take into consideration your feedback -> meilisearch/product#329

mikerogerz · 2021-12-07T23:13:36Z

Hey @curquiza I really appreciate it. I'll follow that discussion to see how it progresses.

curquiza added the impacts docs This issue involves changes in the Meilisearch's documentation label Oct 5, 2021

curquiza added this to the v0.24.0 milestone Oct 5, 2021

curquiza assigned ManyTheFish Oct 5, 2021

curquiza changed the title ~~Increase limit: number of postitions (~ words) per attribute~~ Increase limit: number of positions (~ words) per attribute Oct 5, 2021

curquiza added the tracking issue Tracks development of a global issue label Oct 5, 2021

curquiza added breaking change The related changes are breaking for the users and removed breaking change The related changes are breaking for the users labels Oct 7, 2021

guimachiavelli mentioned this issue Oct 12, 2021

v0.24: Increase number of positions per attribute meilisearch/documentation#1195

Closed

1 task

curquiza added the milli Related to the milli workspace label Oct 18, 2021

curquiza mentioned this issue Oct 26, 2021

Optimize document transform #1847

Merged

curquiza linked a pull request Oct 26, 2021 that will close this issue

Optimize document transform #1847

Merged

bors bot closed this as completed in #1847 Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase limit: number of positions (~ words) per attribute #1770

Increase limit: number of positions (~ words) per attribute #1770

curquiza commented Oct 5, 2021 •

edited

Loading

curquiza commented Oct 5, 2021 •

edited

Loading

curquiza commented Oct 18, 2021

remram44 commented Oct 19, 2021

curquiza commented Oct 20, 2021

remram44 commented Oct 20, 2021

curquiza commented Oct 21, 2021

remram44 commented Oct 21, 2021

curquiza commented Oct 21, 2021

remram44 commented Oct 21, 2021

curquiza commented Oct 21, 2021 •

edited

Loading

remram44 commented Oct 21, 2021

ManyTheFish commented Oct 21, 2021 •

edited

Loading

Sembiance commented Nov 10, 2021

curquiza commented Nov 11, 2021

Sembiance commented Nov 11, 2021

mikerogerz commented Nov 29, 2021 •

edited

Loading

ManyTheFish commented Dec 6, 2021

curquiza commented Dec 7, 2021

mikerogerz commented Dec 7, 2021

Increase limit: number of positions (~ words) per attribute #1770

Increase limit: number of positions (~ words) per attribute #1770

Comments

curquiza commented Oct 5, 2021 • edited Loading

curquiza commented Oct 5, 2021 • edited Loading

curquiza commented Oct 18, 2021

remram44 commented Oct 19, 2021

curquiza commented Oct 20, 2021

Question

Answer

Relevancy

Performances & Memory

Technical limit

The arbitrary limit of 1000

remram44 commented Oct 20, 2021

curquiza commented Oct 21, 2021

remram44 commented Oct 21, 2021

curquiza commented Oct 21, 2021

remram44 commented Oct 21, 2021

curquiza commented Oct 21, 2021 • edited Loading

remram44 commented Oct 21, 2021

ManyTheFish commented Oct 21, 2021 • edited Loading

Sembiance commented Nov 10, 2021

curquiza commented Nov 11, 2021

Sembiance commented Nov 11, 2021

mikerogerz commented Nov 29, 2021 • edited Loading

ManyTheFish commented Dec 6, 2021

curquiza commented Dec 7, 2021

mikerogerz commented Dec 7, 2021

curquiza commented Oct 5, 2021 •

edited

Loading

curquiza commented Oct 5, 2021 •

edited

Loading

curquiza commented Oct 21, 2021 •

edited

Loading

ManyTheFish commented Oct 21, 2021 •

edited

Loading

mikerogerz commented Nov 29, 2021 •

edited

Loading