Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable fuzzy text matching in Matcher #11359

Merged
merged 87 commits into from
Jan 10, 2023
Merged

Conversation

kwhumphreys
Copy link
Contributor

@kwhumphreys kwhumphreys commented Aug 22, 2022

Description

work in progress:

  • use Levenshtein distance (polyleven implementation) to allow fuzzy matching of specific tokens in Matcher.
  • FUZZY1, FUZZY2, ... FUZZY9 operators specify max string edit distance to allow a match.
  • operators can apply to single tokens or sets, e.g. {"LOWER": {"FUZZY3: "string"}} or {"LOWER": {"FUZZY2": {"IN": ["string1", "string2"]}}}

Types of change

enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@shadeMe shadeMe added enhancement Feature requests and improvements ⚠️ wip Work in progress feat / matcher Feature: Token, phrase and dependency matcher labels Aug 22, 2022
@kwhumphreys
Copy link
Contributor Author

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Aug 22, 2022

Just a few initial notes:

  • I think there are a lot of users who would like to have something like this, so this is a nice idea!

  • we're pretty conservative about adding new dependencies to spacy, so we'll have to seriously consider whether we want to add rapidfuzz

  • this feels like it ought to be an extra operator (similar to REGEX) rather than being a Matcher setting that's applied to a hard-coded subset of attributes

    pattern = [{"ORTH": {"FUZZY": "word"}}]

    Then you could apply it to any attributes that you want including custom extensions. What's trickier with the operator is a parameter for fuzziness.

@kwhumphreys
Copy link
Contributor Author

Thanks @adrianeboyd. This plugin uses the notation you suggest https://spacy.io/universe/project/spaczz but in my evaluation it is much too slow to be practical, even with minimal use of the operator. Therefore I was trying for a more direct integration with the underlying rapidfuzz package.

Agreed that a user-specified list of attributes would be better than a hard-coded list, but I'd like to be able to specify this on Matcher init rather than on individual tokens. My use case is a large existing set of patterns which I'd like to optionally apply with fuzzy matching, without having to modify every pattern (but perhaps this could be done in a pattern pre-processing stage).

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Aug 23, 2022

Have you tried implementing FUZZY within spacy's Matcher similar to REGEX? I think that spazz has some of its own python-only matcher implementation, so it's not just the extra operator that's causing the difference.

If you want to get a general idea of the speed difference just for the operator, you can try comparing {"ORTH": "the"} to {"ORTH": {"REGEX": "^the$"}}.

setup.cfg Outdated Show resolved Hide resolved
@kwhumphreys
Copy link
Contributor Author

added a FUZZY attribute, as suggested (and as compatible with spaczz), but using a fixed fuzzy threshold value specified on init.

also added a parameter to specify a list of attributes on init, to be treated as fuzzy in all patterns, e.g. Matcher(en_vocab, fuzzy=85, fuzzy_attrs=["ORTH", "LOWER"]).

TODO: handle extension attributes
TODO: handle values in sets

pyproject.toml Outdated Show resolved Hide resolved
@adrianeboyd
Copy link
Contributor

We think this is a good idea and we discussed a bit internally how we'd like to support it:

  • we would prefer not to add rapidfuzz as a dependency, but would instead prefer to directly include one edit distance / similarity implementation in spacy (possibly from rapidfuzz, or reimplemented ourselves if that makes more sense)
  • we don't think that a top-level fuzzy setting for Matcher or EntityRuler makes sense because the appropriate fuzziness often depends so much on the exact strings / pattern / task and different patterns in the same Matcher would need different levels of fuzziness
  • we're still not quite sure how to implement a fuzziness parameter for the operator; my initial clunky idea would be to support a small range of hard-coded operators like FUZZY75, FUZZY80, etc. but we're still thinking about the options/details here (ideas welcome!)
  • it's possible there would be no way to support this in combination with set operators

@maxbachmann
Copy link

we would prefer not to add rapidfuzz as a dependency, but would instead prefer to directly include one edit distance / similarity implementation in spacy (possibly from rapidfuzz, or reimplemented ourselves if that makes more sense)

You can directly build against the C++ library https://github.com/maxbachmann/rapidfuzz-cpp.

@kwhumphreys
Copy link
Contributor Author

  • it's possible there would be no way to support this in combination with set operators

@adrianeboyd I just spent some time adding support for set operators. see https://github.com/explosion/spaCy/pull/11359/files#diff-d7740e9e8c9929c0b0e8e962c46a22e09263ef4891faceaa15415b2be94a5837R221-R263 but there probably isn't any way to use different fuzziness parameters for different set members.

@kwhumphreys
Copy link
Contributor Author

kwhumphreys commented Aug 29, 2022

  • we're still not quite sure how to implement a fuzziness parameter for the operator; my initial clunky idea would be to support a small range of hard-coded operators like FUZZY75, FUZZY80, etc. but we're still thinking about the options/details here (ideas welcome!)

FUZZY1, FUZZY2, ... FUZZYN where N is the allowed Levenshtein string edit distance might be simpler than a 0-100 scale?

@adrianeboyd
Copy link
Contributor

FUZZY1, FUZZY2, ... FUZZYN where N is the allowed Levenshtein string edit distance might be simpler than a 0-100 scale?

I like that idea! It is definitely concrete and easy for users to understand.

@kwhumphreys
Copy link
Contributor Author

@adrianeboyd a small, fast, MIT-licensed implementation of Levenshtein distance here https://github.com/roy-ht/editdistance
Would this be acceptable to either import or copy into spaCy?

@adrianeboyd
Copy link
Contributor

From looking at examples for the docs, I think my proposed default of 20% (floor) is way too low. It means that it allows 2 edits for any words from 1-14 characters, which isn't much of a useful range.

@adrianeboyd
Copy link
Contributor

It also seemed kind of odd to just have 1..5, so I added 6..9, which gives you a total of ten buckets total to use for some very custom fuzzy_compare method.

@kwhumphreys
Copy link
Contributor Author

From looking at examples for the docs, I think my proposed default of 20% (floor) is way too low. It means that it allows 2 edits for any words from 1-14 characters, which isn't much of a useful range.

My original function was trying to implement word length - 2 up to a maximum of 5. I don't think I've found any English word pairs where >5 edits is a useful match.

@adrianeboyd
Copy link
Contributor

Having spacy.matcher.matcher.fuzzy_compare in the docs is kind of ugly, but I'm not sure we really want to intentionally export this higher. Unlike levenshtein, there's no particular reason for people to use this independently of the matcher.

@svlandeg svlandeg added the v3.5 Related to v3.5 label Jan 2, 2023
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great - thanks @kwhumphreys and @adrianeboyd for your work on this!

I had a few comments & suggestions, but overall it'll be great to have this in spaCy 3.5 🙂

spacy/matcher/matcher.pyx Outdated Show resolved Hide resolved
spacy/matcher/matcher.pyi Outdated Show resolved Hide resolved
spacy/matcher/matcher.pyx Show resolved Hide resolved
spacy/matcher/matcher.pyx Show resolved Hide resolved
spacy/matcher/matcher.pyx Outdated Show resolved Hide resolved
spacy/matcher/matcher.pyx Outdated Show resolved Hide resolved
spacy/schemas.py Outdated Show resolved Hide resolved
spacy/tests/matcher/test_matcher_api.py Outdated Show resolved Hide resolved
@svlandeg svlandeg removed the ⚠️ wip Work in progress label Jan 2, 2023
@adrianeboyd adrianeboyd changed the title WIP: enable fuzzy text matching in Matcher Enable fuzzy text matching in Matcher Jan 9, 2023
@svlandeg svlandeg merged commit 19650eb into explosion:master Jan 10, 2023
@kwhumphreys
Copy link
Contributor Author

@adrianeboyd note this usage, which I just came across today: https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness
it uses low/high token lengths, similarly to the min/max in the original fuzzy_compare function.

@shadeMe shadeMe mentioned this pull request Jan 16, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher v3.5 Related to v3.5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants