Enable fuzzy text matching in Matcher #11359

kwhumphreys · 2022-08-22T15:09:36Z

Description

work in progress:

use Levenshtein distance (polyleven implementation) to allow fuzzy matching of specific tokens in Matcher.
FUZZY1, FUZZY2, ... FUZZY9 operators specify max string edit distance to allow a match.
operators can apply to single tokens or sets, e.g. {"LOWER": {"FUZZY3: "string"}} or {"LOWER": {"FUZZY2": {"IN": ["string1", "string2"]}}}

Types of change

enhancement

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

kwhumphreys · 2022-08-22T16:50:00Z

any Cython advice on how to avoid with gil at https://github.com/explosion/spaCy/pull/11359/files#diff-63f1a25e13e4c412a4decf97e30bf2157f1a0d800efb1d0b25679e1991a8b804R697 ?

spacy/matcher/matcher.pyx

adrianeboyd · 2022-08-22T18:45:37Z

Just a few initial notes:

I think there are a lot of users who would like to have something like this, so this is a nice idea!
we're pretty conservative about adding new dependencies to spacy, so we'll have to seriously consider whether we want to add rapidfuzz
this feels like it ought to be an extra operator (similar to REGEX) rather than being a Matcher setting that's applied to a hard-coded subset of attributes
```
pattern = [{"ORTH": {"FUZZY": "word"}}]
```
Then you could apply it to any attributes that you want including custom extensions. What's trickier with the operator is a parameter for fuzziness.

kwhumphreys · 2022-08-22T19:02:56Z

Thanks @adrianeboyd. This plugin uses the notation you suggest https://spacy.io/universe/project/spaczz but in my evaluation it is much too slow to be practical, even with minimal use of the operator. Therefore I was trying for a more direct integration with the underlying rapidfuzz package.

Agreed that a user-specified list of attributes would be better than a hard-coded list, but I'd like to be able to specify this on Matcher init rather than on individual tokens. My use case is a large existing set of patterns which I'd like to optionally apply with fuzzy matching, without having to modify every pattern (but perhaps this could be done in a pattern pre-processing stage).

adrianeboyd · 2022-08-23T07:49:57Z

Have you tried implementing FUZZY within spacy's Matcher similar to REGEX? I think that spazz has some of its own python-only matcher implementation, so it's not just the extra operator that's causing the difference.

If you want to get a general idea of the speed difference just for the operator, you can try comparing {"ORTH": "the"} to {"ORTH": {"REGEX": "^the$"}}.

not yet used

setup.cfg

kwhumphreys · 2022-08-25T23:37:26Z

added a FUZZY attribute, as suggested (and as compatible with spaczz), but using a fixed fuzzy threshold value specified on init.

also added a parameter to specify a list of attributes on init, to be treated as fuzzy in all patterns, e.g. Matcher(en_vocab, fuzzy=85, fuzzy_attrs=["ORTH", "LOWER"]).

TODO: handle extension attributes
TODO: handle values in sets

pyproject.toml

adrianeboyd · 2022-08-29T09:36:05Z

We think this is a good idea and we discussed a bit internally how we'd like to support it:

we would prefer not to add rapidfuzz as a dependency, but would instead prefer to directly include one edit distance / similarity implementation in spacy (possibly from rapidfuzz, or reimplemented ourselves if that makes more sense)
we don't think that a top-level fuzzy setting for Matcher or EntityRuler makes sense because the appropriate fuzziness often depends so much on the exact strings / pattern / task and different patterns in the same Matcher would need different levels of fuzziness
we're still not quite sure how to implement a fuzziness parameter for the operator; my initial clunky idea would be to support a small range of hard-coded operators like FUZZY75, FUZZY80, etc. but we're still thinking about the options/details here (ideas welcome!)
it's possible there would be no way to support this in combination with set operators

maxbachmann · 2022-08-29T09:46:24Z

we would prefer not to add rapidfuzz as a dependency, but would instead prefer to directly include one edit distance / similarity implementation in spacy (possibly from rapidfuzz, or reimplemented ourselves if that makes more sense)

You can directly build against the C++ library https://github.com/maxbachmann/rapidfuzz-cpp.

kwhumphreys · 2022-08-29T10:55:19Z

it's possible there would be no way to support this in combination with set operators

@adrianeboyd I just spent some time adding support for set operators. see https://github.com/explosion/spaCy/pull/11359/files#diff-d7740e9e8c9929c0b0e8e962c46a22e09263ef4891faceaa15415b2be94a5837R221-R263 but there probably isn't any way to use different fuzziness parameters for different set members.

kwhumphreys · 2022-08-29T11:06:29Z

we're still not quite sure how to implement a fuzziness parameter for the operator; my initial clunky idea would be to support a small range of hard-coded operators like FUZZY75, FUZZY80, etc. but we're still thinking about the options/details here (ideas welcome!)

FUZZY1, FUZZY2, ... FUZZYN where N is the allowed Levenshtein string edit distance might be simpler than a 0-100 scale?

adrianeboyd · 2022-08-29T11:38:44Z

FUZZY1, FUZZY2, ... FUZZYN where N is the allowed Levenshtein string edit distance might be simpler than a 0-100 scale?

I like that idea! It is definitely concrete and easy for users to understand.

kwhumphreys · 2022-08-29T14:03:30Z

@adrianeboyd a small, fast, MIT-licensed implementation of Levenshtein distance here https://github.com/roy-ht/editdistance
Would this be acceptable to either import or copy into spaCy?

adrianeboyd · 2022-12-02T07:58:54Z

From looking at examples for the docs, I think my proposed default of 20% (floor) is way too low. It means that it allows 2 edits for any words from 1-14 characters, which isn't much of a useful range.

adrianeboyd · 2022-12-02T08:05:34Z

It also seemed kind of odd to just have 1..5, so I added 6..9, which gives you a total of ten buckets total to use for some very custom fuzzy_compare method.

kwhumphreys · 2022-12-02T17:37:16Z

From looking at examples for the docs, I think my proposed default of 20% (floor) is way too low. It means that it allows 2 edits for any words from 1-14 characters, which isn't much of a useful range.

My original function was trying to implement word length - 2 up to a maximum of 5. I don't think I've found any English word pairs where >5 edits is a useful match.

spacy/matcher/matcher.pyx

To having naming similar to `phrase_matcher_attr`, rename `fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to `matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.

adrianeboyd · 2022-12-21T08:47:23Z

Having spacy.matcher.matcher.fuzzy_compare in the docs is kind of ugly, but I'm not sure we really want to intentionally export this higher. Unlike levenshtein, there's no particular reason for people to use this independently of the matcher.

svlandeg

This is looking great - thanks @kwhumphreys and @adrianeboyd for your work on this!

I had a few comments & suggestions, but overall it'll be great to have this in spaCy 3.5 🙂

spacy/matcher/matcher.pyx

spacy/matcher/matcher.pyi

spacy/matcher/matcher.pyx

spacy/schemas.py

spacy/tests/matcher/test_matcher_api.py

Co-authored-by: Sofie Van Landeghem <[email protected]>

…tein

website/docs/api/entityruler.md

kwhumphreys · 2023-01-13T00:04:45Z

@adrianeboyd note this usage, which I just came across today: https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness
it uses low/high token lengths, similarly to the min/max in the original fuzzy_compare function.

enable fuzzy matching

1f2e57e

shadeMe added enhancement Feature requests and improvements ⚠️ wip Work in progress feat / matcher Feature: Token, phrase and dependency matcher labels Aug 22, 2022

maxbachmann reviewed Aug 22, 2022

View reviewed changes

spacy/matcher/matcher.pyx Outdated Show resolved Hide resolved

Kevin Humphreys added 3 commits August 24, 2022 13:13

add fuzzy param to EntityMatcher

b617382

include rapidfuzz_capi

ee985a3

not yet used

fix type

9600fe1

maxbachmann reviewed Aug 24, 2022

View reviewed changes

setup.cfg Outdated Show resolved Hide resolved

Kevin Humphreys added 3 commits August 24, 2022 17:54

add FUZZY predicate

3dc5b9c

add fuzzy attribute list

78699ab

fix type properly

c017de9

tidying

c033948

kwhumphreys mentioned this pull request Aug 26, 2022

Speed up the detection process gandersen101/spaczz#20

Open

maxbachmann reviewed Aug 26, 2022

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

Kevin Humphreys added 2 commits August 29, 2022 10:58

remove unnecessary dependency

b189f25

handle fuzzy sets

9bdccf9

simplify fuzzy sets

ecebb5b

case fix

ecd0455

adrianeboyd added 3 commits December 1, 2022 17:52

Fix predicate keys and matching for SetPredicate with FUZZY and REGEX

0e2c284

Add FUZZY6..9

27a4925

Add initial docs

45675e1

kwhumphreys commented Dec 2, 2022

View reviewed changes

spacy/matcher/matcher.pyx Show resolved Hide resolved

Kevin Humphreys and others added 6 commits December 7, 2022 16:13

Merge branch 'explosion:master' into rapidfuzz

b690c91

Increase default fuzzy to rounded 30% of pattern length

e88c724

Merge remote-tracking branch 'upstream/master' into rapidfuzz

bac3a08

Update docs for fuzzy_compare in components

eb65c43

Update EntityRuler and SpanRuler API docs

96a786d

Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare

903c4af

To having naming similar to `phrase_matcher_attr`, rename `fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to `matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.

svlandeg added the v3.5 Related to v3.5 label Jan 2, 2023

svlandeg reviewed Jan 2, 2023

View reviewed changes

spacy/tests/matcher/test_matcher_api.py Outdated Show resolved Hide resolved

svlandeg removed the ⚠️ wip Work in progress label Jan 2, 2023

adrianeboyd and others added 6 commits January 9, 2023 12:57

Fix schema aliases

9042c46

Co-authored-by: Sofie Van Landeghem <[email protected]>

Fix typo

8722f85

Co-authored-by: Sofie Van Landeghem <[email protected]>

Add FUZZY6-9 operators and update tests

213fb85

Parameterize test over greedy

8ee6551

Co-authored-by: Sofie Van Landeghem <[email protected]>

Fix type for fuzzy_compare to remove Optional

0d60744

Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levensh…

e0abb55

…tein

adrianeboyd changed the title ~~WIP: enable fuzzy text matching in Matcher~~ Enable fuzzy text matching in Matcher Jan 9, 2023

svlandeg reviewed Jan 9, 2023

View reviewed changes

website/docs/api/entityruler.md Outdated Show resolved Hide resolved

Update docs following levenshtein_compare renaming

92aca94

svlandeg merged commit 19650eb into explosion:master Jan 10, 2023

shadeMe mentioned this pull request Jan 16, 2023

3.5 usage page #12057

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable fuzzy text matching in Matcher #11359

Enable fuzzy text matching in Matcher #11359

kwhumphreys commented Aug 22, 2022 •

edited by adrianeboyd

Loading

kwhumphreys commented Aug 22, 2022

adrianeboyd commented Aug 22, 2022 •

edited

Loading

kwhumphreys commented Aug 22, 2022

adrianeboyd commented Aug 23, 2022 •

edited

Loading

kwhumphreys commented Aug 25, 2022

adrianeboyd commented Aug 29, 2022

maxbachmann commented Aug 29, 2022

kwhumphreys commented Aug 29, 2022

kwhumphreys commented Aug 29, 2022 •

edited

Loading

adrianeboyd commented Aug 29, 2022

kwhumphreys commented Aug 29, 2022

adrianeboyd commented Dec 2, 2022

adrianeboyd commented Dec 2, 2022

kwhumphreys commented Dec 2, 2022

adrianeboyd commented Dec 21, 2022

svlandeg left a comment

kwhumphreys commented Jan 13, 2023

Enable fuzzy text matching in Matcher #11359

Enable fuzzy text matching in Matcher #11359

Conversation

kwhumphreys commented Aug 22, 2022 • edited by adrianeboyd Loading

Description

Types of change

Checklist

kwhumphreys commented Aug 22, 2022

adrianeboyd commented Aug 22, 2022 • edited Loading

kwhumphreys commented Aug 22, 2022

adrianeboyd commented Aug 23, 2022 • edited Loading

kwhumphreys commented Aug 25, 2022

adrianeboyd commented Aug 29, 2022

maxbachmann commented Aug 29, 2022

kwhumphreys commented Aug 29, 2022

kwhumphreys commented Aug 29, 2022 • edited Loading

adrianeboyd commented Aug 29, 2022

kwhumphreys commented Aug 29, 2022

adrianeboyd commented Dec 2, 2022

adrianeboyd commented Dec 2, 2022

kwhumphreys commented Dec 2, 2022

adrianeboyd commented Dec 21, 2022

svlandeg left a comment

Choose a reason for hiding this comment

kwhumphreys commented Jan 13, 2023

kwhumphreys commented Aug 22, 2022 •

edited by adrianeboyd

Loading

adrianeboyd commented Aug 22, 2022 •

edited

Loading

adrianeboyd commented Aug 23, 2022 •

edited

Loading

kwhumphreys commented Aug 29, 2022 •

edited

Loading