Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creates a TokenMetadataStore to return startPosition of tokens in results #79

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cambridgemike
Copy link

I took a stab at prototyping a solution for #25, and took your comments in #58 into consideration.

I think you're right that the long term goal should be to refactor to replace existing tokens with a lunr.Token. Unfortunately this seems a bit overwhelming, as it presents a lot of complexities with previously serialized dataStores and complicates the TokenStore. As a step in the right direction, I propose a lunr.TokenMetadataStore, which maps doc refs and indexed tokens to a lunr.Token. In this case, by "indexed token" I mean present day token (string) and a lunr.Token, which is an object I introduce to encapsulate metadata about tokens (like StartPosition). This MetadataStore lives on the sidelines, and is only added to when a document is indexed, and data is only retrieved when a document is surfaced in search results.

So now you'll get back a result set that looks like

// idx.add({ id: 2, 
//   body: "Some are born great, some achieve greatness, and some have greatness thrust upon them." 
// })
// idx.search("greatness")

[{
    "ref": 2,
    "score": 0.12345
    "tokens": [
      {indexedAs: "great", raw: "great", startPos: 15, field: "body"},
      {indexedAs: "great", raw: "greatness", startPos: 35, field: "body"},
      {indexedAs: "great", raw: "greatness", startPos: 60, field: "body"}
    ]
}]

Overall, I created three new object types:

  1. lunr.Token which is a container for metadata like startPosition
  2. lunr.TokenList is essentially an array of lunr.Tokens, but has some helper method to extract the indexed tokens and the raw tokens.
  3. lunr.TokenMetadataStore is a dataStore as described above.

The only changes I had to make to the existing codebase are outlined as follows:

pipeline.js#run

This is probably the most substantial change I made. Running the pipeline now returns a lunr.TokenList instead of an array of strings. This works with old tokenizers that return strings or new tokenizers that return lunr.Tokens. The pipeline passes string tokens to the stack, so backwards compatibility is preserved for 3rd party pipeline functions and 3rd party tokenizers.

index.js#add

Updated this method to store the lunr.Tokens returned by the pipeline in the lunr.tokenMetadataStore. I added a configuration variable on the lunr.Index.prototype called useTokenMetadata which controls this behavior.

index.js#search

Once a list of documents is found, return a list of lunr.Tokens that are associated with the docment and were in the query string.

tokenizer.js

This was a pretty big change, and I did a quick and dirty job. I tried to keep the runtime relatively sane, but didn't worry too much about code organization. This should definitely be cleaned up if/when a merge happens.

Notes:

  • Drawbacks: This will use up a lot of memory, since we create an individual token for every single string that gets indexed.
  • Nomenclature: I started using the term "index token" to describe a present day "token", i.e a string that has been run through the pipeline (and will eventually end up in the index). A "raw token" is a String with the value of how the token appeared in the original document (with the exception of being lowercased, since we currently lowercase int he tokenizer).
  • Tests: I updated a few things in the tests, but otherwise they all passed. I wrote a few tests just to smokescreen my additions. If you think this is headed in the right direction then I can sure it up with more tests.

@olivernn
Copy link
Owner

olivernn commented Apr 1, 2014

Hey, thanks for putting your time into this, I'll try and go through your code in more detail at some point but just wanted to say thanks for taking a stab at this!

I've actually been working on some changes that should make #25 and #58 possible. Actually they will be by-products of what I'm trying to achieve which is to take token position into account when scoring matching documents, e.g. a document that contains search tokens closer together should score higher.

A slightly related feature is also to have better wildcard support, currently a wildcard is automatically appended to every search term, this mostly works but has caused some issues (#74, #62, #33). What I want is for you to be able to enter a wildcard where you want/need it, at the beginning, end or in the middle of a search term.

Both of these require a lunr.Token object, and as you have found this is not such a trivial change 😉 I've actually got an implementation already, its still mostly focused on the wildcard stuff but I have a feeling it will be an enabler for all sorts of niceness and features like this.

I'm not overly concerned with backwards compatibility, yet. lunr isn't quite 1.0 so I feel I can still experiment a little with the public interfaces. Serialised indexes can be re-built and lunr currently warns you if you are using an index serialised with a different version.

My current work is still very 'work in progress' but I'll try and tidy it up a little and push the branch here so you, and others, can take a look with where I'm going. I'd really appreciate your input on what I've got and how to make sure it is compatible with what you're trying to achieve.

Thanks again for you help, I'll be sure to keep you updated.

@cambridgemike
Copy link
Author

Thanks for the thoughtful response, I'd love to see what you've been working on. Having token position be involved in the search process is also important for exact searches (i.e, "dog food" where the words "dog" and "food" appear next to each other in a document).

Cheers,
Mike

@hugovincent
Copy link

Any update on this? I'd love to be able to use this feature.

@olivernn
Copy link
Owner

Thanks for you interest @hugovincent. I've been working on a couple of changes to the way lunr works to support this feature, as well as better wildcard searching and a pipeline for scoring documents. You can follow along with whats happening on the next branch.

Everything is still very alpha and might change at any point, but it should at least give you an idea as to where this feature is going.

@aschuck
Copy link

aschuck commented Oct 2, 2015

@olivernn Did anything become of the next branch, or of @cambridgemike's patch?

@olivernn
Copy link
Owner

olivernn commented Oct 5, 2015

@aschuck sadly no, its all still available on github, but I haven't had a chance to take these any further.

@clns
Copy link

clns commented Mar 2, 2016

I guess there are no updates with this, sadly. 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants