-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable fuzzy text matching in Matcher #11359
Conversation
any Cython advice on how to avoid |
Just a few initial notes:
|
Thanks @adrianeboyd. This plugin uses the notation you suggest https://spacy.io/universe/project/spaczz but in my evaluation it is much too slow to be practical, even with minimal use of the operator. Therefore I was trying for a more direct integration with the underlying rapidfuzz package. Agreed that a user-specified list of attributes would be better than a hard-coded list, but I'd like to be able to specify this on Matcher init rather than on individual tokens. My use case is a large existing set of patterns which I'd like to optionally apply with fuzzy matching, without having to modify every pattern (but perhaps this could be done in a pattern pre-processing stage). |
Have you tried implementing If you want to get a general idea of the speed difference just for the operator, you can try comparing |
added a also added a parameter to specify a list of attributes on init, to be treated as fuzzy in all patterns, e.g. TODO: handle extension attributes |
We think this is a good idea and we discussed a bit internally how we'd like to support it:
|
You can directly build against the C++ library https://github.com/maxbachmann/rapidfuzz-cpp. |
@adrianeboyd I just spent some time adding support for set operators. see https://github.com/explosion/spaCy/pull/11359/files#diff-d7740e9e8c9929c0b0e8e962c46a22e09263ef4891faceaa15415b2be94a5837R221-R263 but there probably isn't any way to use different fuzziness parameters for different set members. |
|
I like that idea! It is definitely concrete and easy for users to understand. |
@adrianeboyd a small, fast, MIT-licensed implementation of Levenshtein distance here https://github.com/roy-ht/editdistance |
From looking at examples for the docs, I think my proposed default of 20% (floor) is way too low. It means that it allows 2 edits for any words from 1-14 characters, which isn't much of a useful range. |
It also seemed kind of odd to just have 1..5, so I added 6..9, which gives you a total of ten buckets total to use for some very custom |
My original function was trying to implement word length - 2 up to a maximum of 5. I don't think I've found any English word pairs where >5 edits is a useful match. |
To having naming similar to `phrase_matcher_attr`, rename `fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to `matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.
Having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great - thanks @kwhumphreys and @adrianeboyd for your work on this!
I had a few comments & suggestions, but overall it'll be great to have this in spaCy 3.5 🙂
Co-authored-by: Sofie Van Landeghem <[email protected]>
Co-authored-by: Sofie Van Landeghem <[email protected]>
Co-authored-by: Sofie Van Landeghem <[email protected]>
@adrianeboyd note this usage, which I just came across today: https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness |
Description
work in progress:
{"LOWER": {"FUZZY3: "string"}}
or{"LOWER": {"FUZZY2": {"IN": ["string1", "string2"]}}}
Types of change
enhancement
Checklist