-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creates a TokenMetadataStore to return startPosition of tokens in results #79
base: master
Are you sure you want to change the base?
Conversation
Hey, thanks for putting your time into this, I'll try and go through your code in more detail at some point but just wanted to say thanks for taking a stab at this! I've actually been working on some changes that should make #25 and #58 possible. Actually they will be by-products of what I'm trying to achieve which is to take token position into account when scoring matching documents, e.g. a document that contains search tokens closer together should score higher. A slightly related feature is also to have better wildcard support, currently a wildcard is automatically appended to every search term, this mostly works but has caused some issues (#74, #62, #33). What I want is for you to be able to enter a wildcard where you want/need it, at the beginning, end or in the middle of a search term. Both of these require a I'm not overly concerned with backwards compatibility, yet. lunr isn't quite 1.0 so I feel I can still experiment a little with the public interfaces. Serialised indexes can be re-built and lunr currently warns you if you are using an index serialised with a different version. My current work is still very 'work in progress' but I'll try and tidy it up a little and push the branch here so you, and others, can take a look with where I'm going. I'd really appreciate your input on what I've got and how to make sure it is compatible with what you're trying to achieve. Thanks again for you help, I'll be sure to keep you updated. |
Thanks for the thoughtful response, I'd love to see what you've been working on. Having token position be involved in the search process is also important for exact searches (i.e, "dog food" where the words "dog" and "food" appear next to each other in a document). Cheers, |
Any update on this? I'd love to be able to use this feature. |
Thanks for you interest @hugovincent. I've been working on a couple of changes to the way lunr works to support this feature, as well as better wildcard searching and a pipeline for scoring documents. You can follow along with whats happening on the next branch. Everything is still very alpha and might change at any point, but it should at least give you an idea as to where this feature is going. |
@olivernn Did anything become of the next branch, or of @cambridgemike's patch? |
@aschuck sadly no, its all still available on github, but I haven't had a chance to take these any further. |
I guess there are no updates with this, sadly. 😞 |
I took a stab at prototyping a solution for #25, and took your comments in #58 into consideration.
I think you're right that the long term goal should be to refactor to replace existing tokens with a
lunr.Token
. Unfortunately this seems a bit overwhelming, as it presents a lot of complexities with previously serialized dataStores and complicates the TokenStore. As a step in the right direction, I propose alunr.TokenMetadataStore
, which maps doc refs and indexed tokens to alunr.Token
. In this case, by "indexed token" I mean present day token (string) and alunr.Token
, which is an object I introduce to encapsulate metadata about tokens (like StartPosition). This MetadataStore lives on the sidelines, and is only added to when a document is indexed, and data is only retrieved when a document is surfaced in search results.So now you'll get back a result set that looks like
Overall, I created three new object types:
lunr.Token
which is a container for metadata like startPositionlunr.TokenList
is essentially an array oflunr.Tokens
, but has some helper method to extract the indexed tokens and the raw tokens.lunr.TokenMetadataStore
is a dataStore as described above.The only changes I had to make to the existing codebase are outlined as follows:
pipeline.js#run
This is probably the most substantial change I made. Running the pipeline now returns a
lunr.TokenList
instead of an array of strings. This works with old tokenizers that return strings or new tokenizers that returnlunr.Token
s. The pipeline passes string tokens to the stack, so backwards compatibility is preserved for 3rd party pipeline functions and 3rd party tokenizers.index.js#add
Updated this method to store the
lunr.Tokens
returned by the pipeline in thelunr.tokenMetadataStore
. I added a configuration variable on thelunr.Index.prototype
calleduseTokenMetadata
which controls this behavior.index.js#search
Once a list of documents is found, return a list of
lunr.Tokens
that are associated with the docment and were in the query string.tokenizer.js
This was a pretty big change, and I did a quick and dirty job. I tried to keep the runtime relatively sane, but didn't worry too much about code organization. This should definitely be cleaned up if/when a merge happens.
Notes: