Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable fuzzy text matching in Matcher #11359

Merged
merged 87 commits into from
Jan 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
1f2e57e
enable fuzzy matching
Aug 22, 2022
b617382
add fuzzy param to EntityMatcher
Aug 24, 2022
ee985a3
include rapidfuzz_capi
Aug 24, 2022
9600fe1
fix type
Aug 24, 2022
3dc5b9c
add FUZZY predicate
Aug 24, 2022
78699ab
add fuzzy attribute list
Aug 25, 2022
c017de9
fix type properly
Aug 25, 2022
c033948
tidying
Aug 26, 2022
b189f25
remove unnecessary dependency
Aug 29, 2022
9bdccf9
handle fuzzy sets
Aug 29, 2022
ecebb5b
simplify fuzzy sets
Aug 29, 2022
ecd0455
case fix
Aug 29, 2022
43948f7
switch to FUZZYn predicates
Aug 29, 2022
a8a4d86
revert changes added for fuzzy param
Aug 29, 2022
59021f7
switch to polyleven
Aug 29, 2022
dacfb57
enable fuzzy matching
Aug 22, 2022
66e9fdd
add fuzzy param to EntityMatcher
Aug 24, 2022
3a63ad1
include rapidfuzz_capi
Aug 24, 2022
426f334
fix type
Aug 24, 2022
594674d
add FUZZY predicate
Aug 24, 2022
63f5e13
add fuzzy attribute list
Aug 25, 2022
3dba984
fix type properly
Aug 25, 2022
ee25d43
tidying
Aug 26, 2022
0859e39
remove unnecessary dependency
Aug 29, 2022
9c0f936
handle fuzzy sets
Aug 29, 2022
e636f49
simplify fuzzy sets
Aug 29, 2022
974e5f9
case fix
Aug 29, 2022
3591a69
switch to FUZZYn predicates
Aug 29, 2022
568a843
revert changes added for fuzzy param
Aug 29, 2022
a6d26a0
switch to polyleven
Aug 29, 2022
b7599df
fuzzy match only on oov tokens
Sep 14, 2022
b393525
Merge branch 'rapidfuzz' of https://github.com/kwhumphreys/spaCy into…
Sep 14, 2022
711f16c
Merge branch 'master' into rapidfuzz
Sep 15, 2022
a1c9840
remove polyleven
Sep 15, 2022
252e9ab
exclude whitespace tokens
Sep 15, 2022
4a677ac
don't allow more edits than characters
Sep 15, 2022
eab96f7
fix min distance
Sep 22, 2022
0da324a
reinstate FUZZY operator
Sep 23, 2022
bf4b353
handle sets inside regex operator
Sep 28, 2022
9c83b80
remove is_oov check
Oct 14, 2022
b2f7dbe
attempt build fix
Oct 17, 2022
ab28d68
re-attempt build fix
Oct 17, 2022
920835e
Merge branch 'explosion:master' into rapidfuzz
Oct 17, 2022
ed8cb55
don't overwrite fuzzy param value
Oct 18, 2022
9d6cd2f
Merge branch 'rapidfuzz' of https://github.com/kwhumphreys/spaCy into…
Oct 18, 2022
1bfbd29
Merge branch 'explosion:master' into rapidfuzz
Oct 28, 2022
6e64a5c
move fuzzy_match
Oct 31, 2022
49e9317
move fuzzy_match back inside Matcher
Nov 3, 2022
feb068a
Format tests
adrianeboyd Nov 11, 2022
070cbf2
Parametrize fuzzyn tests
adrianeboyd Nov 11, 2022
d677335
Parametrize and merge fuzzy+set tests
adrianeboyd Nov 11, 2022
7e25c7f
Format
adrianeboyd Nov 11, 2022
6ae4c99
Move fuzzy_match to a standalone method
adrianeboyd Nov 11, 2022
c81330d
Change regex kwarg type to bool
adrianeboyd Nov 11, 2022
ae0bb75
Add types for fuzzy_match
adrianeboyd Nov 14, 2022
6a9fec7
Parametrize fuzzyn+set tests
adrianeboyd Nov 12, 2022
183b13d
Minor refactoring for fuzz/fuzzy
adrianeboyd Nov 14, 2022
7d10609
Make fuzzy_match a Matcher kwarg
adrianeboyd Nov 14, 2022
a1f87dc
Update type for _default_fuzzy_match
adrianeboyd Nov 14, 2022
efa638a
don't overwrite function param
Nov 24, 2022
c3f446f
Rename to fuzzy_compare
adrianeboyd Nov 28, 2022
bde46b7
Update fuzzy_compare default argument declarations
adrianeboyd Nov 28, 2022
5088949
Merge remote-tracking branch 'upstream/master' into rapidfuzz
adrianeboyd Nov 28, 2022
e029616
allow fuzzy_compare override from EntityRuler
Nov 29, 2022
2243803
define new Matcher keyword arg
Nov 29, 2022
561adac
fix type definition
Nov 29, 2022
d1628df
Implement fuzzy_compare config option for EntityRuler and SpanRuler
adrianeboyd Nov 29, 2022
3c6dc10
Rename _default_fuzzy_compare to fuzzy_compare, remove from reexporte…
adrianeboyd Nov 29, 2022
9fc37d4
Use simpler fuzzy_compare algorithm
adrianeboyd Nov 29, 2022
7a27fa7
Update types
adrianeboyd Nov 29, 2022
8a749fc
Increase minimum to 2 in fuzzy_compare to allow one transposition
adrianeboyd Nov 30, 2022
0e2c284
Fix predicate keys and matching for SetPredicate with FUZZY and REGEX
adrianeboyd Dec 1, 2022
27a4925
Add FUZZY6..9
adrianeboyd Dec 2, 2022
45675e1
Add initial docs
adrianeboyd Dec 2, 2022
b690c91
Merge branch 'explosion:master' into rapidfuzz
Dec 8, 2022
e88c724
Increase default fuzzy to rounded 30% of pattern length
adrianeboyd Dec 8, 2022
bac3a08
Merge remote-tracking branch 'upstream/master' into rapidfuzz
adrianeboyd Dec 19, 2022
eb65c43
Update docs for fuzzy_compare in components
adrianeboyd Dec 19, 2022
96a786d
Update EntityRuler and SpanRuler API docs
adrianeboyd Dec 19, 2022
903c4af
Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare
adrianeboyd Dec 21, 2022
9042c46
Fix schema aliases
adrianeboyd Jan 9, 2023
8722f85
Fix typo
adrianeboyd Jan 9, 2023
213fb85
Add FUZZY6-9 operators and update tests
adrianeboyd Jan 9, 2023
8ee6551
Parameterize test over greedy
adrianeboyd Jan 9, 2023
0d60744
Fix type for fuzzy_compare to remove Optional
adrianeboyd Jan 9, 2023
e0abb55
Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levensh…
adrianeboyd Jan 9, 2023
92aca94
Update docs following levenshtein_compare renaming
adrianeboyd Jan 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions spacy/matcher/levenshtein.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ from libc.stdint cimport int64_t

from typing import Optional

from ..util import registry


cdef extern from "polyleven.c":
int64_t polyleven(PyObject *o1, PyObject *o2, int64_t k)
Expand All @@ -13,3 +15,18 @@ cpdef int64_t levenshtein(a: str, b: str, k: Optional[int] = None):
if k is None:
k = -1
return polyleven(<PyObject*>a, <PyObject*>b, k)


cpdef bint levenshtein_compare(input_text: str, pattern_text: str, fuzzy: int = -1):
if fuzzy >= 0:
max_edits = fuzzy
else:
# allow at least two edits (to allow at least one transposition) and up
# to 20% of the pattern string length
max_edits = max(2, round(0.3 * len(pattern_text)))
return levenshtein(input_text, pattern_text, max_edits) <= max_edits


@registry.misc("spacy.levenshtein_compare.v1")
def make_levenshtein_compare():
return levenshtein_compare
1 change: 1 addition & 0 deletions spacy/matcher/matcher.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,4 @@ cdef class Matcher:
cdef public object _extensions
cdef public object _extra_predicates
cdef public object _seen_attrs
cdef public object _fuzzy_compare
3 changes: 2 additions & 1 deletion spacy/matcher/matcher.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ from ..vocab import Vocab
from ..tokens import Doc, Span

class Matcher:
def __init__(self, vocab: Vocab, validate: bool = ...) -> None: ...
def __init__(self, vocab: Vocab, validate: bool = ...,
fuzzy_compare: Callable[[str, str, int], bool] = ...) -> None: ...
def __reduce__(self) -> Any: ...
def __len__(self) -> int: ...
def __contains__(self, key: str) -> bool: ...
Expand Down
170 changes: 135 additions & 35 deletions spacy/matcher/matcher.pyx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# cython: infer_types=True, profile=True
# cython: binding=True, infer_types=True, profile=True
from typing import List, Iterable

from libcpp.vector cimport vector
Expand All @@ -20,10 +20,12 @@ from ..tokens.token cimport Token
from ..tokens.morphanalysis cimport MorphAnalysis
from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH, ENT_IOB

from .levenshtein import levenshtein_compare
from ..schemas import validate_token_pattern
from ..errors import Errors, MatchPatternError, Warnings
from ..strings import get_string_id
from ..attrs import IDS
from ..util import registry


DEF PADDING = 5
Expand All @@ -36,11 +38,13 @@ cdef class Matcher:
USAGE: https://spacy.io/usage/rule-based-matching
"""

def __init__(self, vocab, validate=True):
def __init__(self, vocab, validate=True, *, fuzzy_compare=levenshtein_compare):
"""Create the Matcher.

vocab (Vocab): The vocabulary object, which must be shared with the
documents the matcher will operate on.
validate (bool): Validate all patterns added to this matcher.
fuzzy_compare (Callable[[str, str, int], bool]): The comparison method
for the FUZZY operators.
"""
self._extra_predicates = []
self._patterns = {}
Expand All @@ -51,9 +55,10 @@ cdef class Matcher:
self.vocab = vocab
self.mem = Pool()
self.validate = validate
self._fuzzy_compare = fuzzy_compare
adrianeboyd marked this conversation as resolved.
Show resolved Hide resolved

def __reduce__(self):
data = (self.vocab, self._patterns, self._callbacks)
data = (self.vocab, self._patterns, self._callbacks, self.validate, self._fuzzy_compare)
return (unpickle_matcher, data, None, None)

def __len__(self):
Expand Down Expand Up @@ -128,7 +133,7 @@ cdef class Matcher:
for pattern in patterns:
try:
specs = _preprocess_pattern(pattern, self.vocab,
self._extensions, self._extra_predicates)
self._extensions, self._extra_predicates, self._fuzzy_compare)
self.patterns.push_back(init_pattern(self.mem, key, specs))
for spec in specs:
for attr, _ in spec[1]:
Expand Down Expand Up @@ -326,8 +331,8 @@ cdef class Matcher:
return key


def unpickle_matcher(vocab, patterns, callbacks):
matcher = Matcher(vocab)
def unpickle_matcher(vocab, patterns, callbacks, validate, fuzzy_compare):
matcher = Matcher(vocab, validate=validate, fuzzy_compare=fuzzy_compare)
adrianeboyd marked this conversation as resolved.
Show resolved Hide resolved
for key, pattern in patterns.items():
callback = callbacks.get(key, None)
matcher.add(key, pattern, on_match=callback)
Expand Down Expand Up @@ -754,7 +759,7 @@ cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
return id_attr.value


def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates, fuzzy_compare):
"""This function interprets the pattern, converting the various bits of
syntactic sugar before we compile it into a struct with init_pattern.

Expand All @@ -781,7 +786,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
ops = _get_operators(spec)
attr_values = _get_attr_values(spec, string_store)
extensions = _get_extensions(spec, string_store, extensions_table)
predicates = _get_extra_predicates(spec, extra_predicates, vocab)
predicates = _get_extra_predicates(spec, extra_predicates, vocab, fuzzy_compare)
for op in ops:
tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx))
return tokens
Expand Down Expand Up @@ -826,16 +831,45 @@ def _get_attr_values(spec, string_store):
# These predicate helper classes are used to match the REGEX, IN, >= etc
# extensions to the matcher introduced in #3173.

class _FuzzyPredicate:
operators = ("FUZZY", "FUZZY1", "FUZZY2", "FUZZY3", "FUZZY4", "FUZZY5",
"FUZZY6", "FUZZY7", "FUZZY8", "FUZZY9")

def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None,
regex=False, fuzzy=None, fuzzy_compare=None):
self.i = i
self.attr = attr
self.value = value
self.predicate = predicate
self.is_extension = is_extension
if self.predicate not in self.operators:
raise ValueError(Errors.E126.format(good=self.operators, bad=self.predicate))
fuzz = self.predicate[len("FUZZY"):] # number after prefix
self.fuzzy = int(fuzz) if fuzz else -1
self.fuzzy_compare = fuzzy_compare
self.key = (self.attr, self.fuzzy, self.predicate, srsly.json_dumps(value, sort_keys=True))

def __call__(self, Token token):
if self.is_extension:
value = token._.get(self.attr)
else:
value = token.vocab.strings[get_token_attr_for_matcher(token.c, self.attr)]
if self.value == value:
return True
return self.fuzzy_compare(value, self.value, self.fuzzy)


class _RegexPredicate:
operators = ("REGEX",)

def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None):
def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None,
regex=False, fuzzy=None, fuzzy_compare=None):
self.i = i
self.attr = attr
self.value = re.compile(value)
self.predicate = predicate
self.is_extension = is_extension
self.key = (attr, self.predicate, srsly.json_dumps(value, sort_keys=True))
self.key = (self.attr, self.predicate, srsly.json_dumps(value, sort_keys=True))
if self.predicate not in self.operators:
raise ValueError(Errors.E126.format(good=self.operators, bad=self.predicate))

Expand All @@ -850,18 +884,28 @@ class _RegexPredicate:
class _SetPredicate:
operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET", "INTERSECTS")

def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None):
def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None,
regex=False, fuzzy=None, fuzzy_compare=None):
self.i = i
self.attr = attr
self.vocab = vocab
self.regex = regex
self.fuzzy = fuzzy
self.fuzzy_compare = fuzzy_compare
if self.attr == MORPH:
# normalize morph strings
self.value = set(self.vocab.morphology.add(v) for v in value)
else:
self.value = set(get_string_id(v) for v in value)
if self.regex:
self.value = set(re.compile(v) for v in value)
elif self.fuzzy is not None:
# add to string store
self.value = set(self.vocab.strings.add(v) for v in value)
else:
self.value = set(get_string_id(v) for v in value)
self.predicate = predicate
self.is_extension = is_extension
self.key = (attr, self.predicate, srsly.json_dumps(value, sort_keys=True))
self.key = (self.attr, self.regex, self.fuzzy, self.predicate, srsly.json_dumps(value, sort_keys=True))
if self.predicate not in self.operators:
raise ValueError(Errors.E126.format(good=self.operators, bad=self.predicate))

Expand Down Expand Up @@ -889,9 +933,29 @@ class _SetPredicate:
return False

if self.predicate == "IN":
return value in self.value
if self.regex:
value = self.vocab.strings[value]
return any(bool(v.search(value)) for v in self.value)
elif self.fuzzy is not None:
value = self.vocab.strings[value]
return any(self.fuzzy_compare(value, self.vocab.strings[v], self.fuzzy)
for v in self.value)
elif value in self.value:
return True
svlandeg marked this conversation as resolved.
Show resolved Hide resolved
else:
return False
elif self.predicate == "NOT_IN":
return value not in self.value
if self.regex:
value = self.vocab.strings[value]
return not any(bool(v.search(value)) for v in self.value)
elif self.fuzzy is not None:
value = self.vocab.strings[value]
return not any(self.fuzzy_compare(value, self.vocab.strings[v], self.fuzzy)
for v in self.value)
elif value in self.value:
return False
else:
return True
elif self.predicate == "IS_SUBSET":
return value <= self.value
elif self.predicate == "IS_SUPERSET":
Expand All @@ -906,13 +970,14 @@ class _SetPredicate:
class _ComparisonPredicate:
operators = ("==", "!=", ">=", "<=", ">", "<")

def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None):
def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None,
regex=False, fuzzy=None, fuzzy_compare=None):
self.i = i
self.attr = attr
self.value = value
self.predicate = predicate
self.is_extension = is_extension
self.key = (attr, self.predicate, srsly.json_dumps(value, sort_keys=True))
self.key = (self.attr, self.predicate, srsly.json_dumps(value, sort_keys=True))
if self.predicate not in self.operators:
raise ValueError(Errors.E126.format(good=self.operators, bad=self.predicate))

Expand All @@ -935,7 +1000,7 @@ class _ComparisonPredicate:
return value < self.value


def _get_extra_predicates(spec, extra_predicates, vocab):
def _get_extra_predicates(spec, extra_predicates, vocab, fuzzy_compare):
predicate_types = {
"REGEX": _RegexPredicate,
"IN": _SetPredicate,
Expand All @@ -949,6 +1014,16 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
"<=": _ComparisonPredicate,
">": _ComparisonPredicate,
"<": _ComparisonPredicate,
"FUZZY": _FuzzyPredicate,
"FUZZY1": _FuzzyPredicate,
"FUZZY2": _FuzzyPredicate,
"FUZZY3": _FuzzyPredicate,
"FUZZY4": _FuzzyPredicate,
"FUZZY5": _FuzzyPredicate,
"FUZZY6": _FuzzyPredicate,
"FUZZY7": _FuzzyPredicate,
"FUZZY8": _FuzzyPredicate,
"FUZZY9": _FuzzyPredicate,
}
seen_predicates = {pred.key: pred.i for pred in extra_predicates}
output = []
Expand All @@ -966,22 +1041,47 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
attr = "ORTH"
attr = IDS.get(attr.upper())
if isinstance(value, dict):
processed = False
value_with_upper_keys = {k.upper(): v for k, v in value.items()}
for type_, cls in predicate_types.items():
if type_ in value_with_upper_keys:
predicate = cls(len(extra_predicates), attr, value_with_upper_keys[type_], type_, vocab=vocab)
# Don't create a redundant predicates.
# This helps with efficiency, as we're caching the results.
if predicate.key in seen_predicates:
output.append(seen_predicates[predicate.key])
else:
extra_predicates.append(predicate)
output.append(predicate.i)
seen_predicates[predicate.key] = predicate.i
processed = True
if not processed:
warnings.warn(Warnings.W035.format(pattern=value))
output.extend(_get_extra_predicates_dict(attr, value, vocab, predicate_types,
extra_predicates, seen_predicates, fuzzy_compare=fuzzy_compare))
return output


def _get_extra_predicates_dict(attr, value_dict, vocab, predicate_types,
extra_predicates, seen_predicates, regex=False, fuzzy=None, fuzzy_compare=None):
output = []
for type_, value in value_dict.items():
type_ = type_.upper()
cls = predicate_types.get(type_)
if cls is None:
warnings.warn(Warnings.W035.format(pattern=value_dict))
# ignore unrecognized predicate type
continue
elif cls == _RegexPredicate:
if isinstance(value, dict):
# add predicates inside regex operator
output.extend(_get_extra_predicates_dict(attr, value, vocab, predicate_types,
extra_predicates, seen_predicates,
regex=True))
continue
elif cls == _FuzzyPredicate:
if isinstance(value, dict):
# add predicates inside fuzzy operator
fuzz = type_[len("FUZZY"):] # number after prefix
fuzzy_val = int(fuzz) if fuzz else -1
output.extend(_get_extra_predicates_dict(attr, value, vocab, predicate_types,
extra_predicates, seen_predicates,
fuzzy=fuzzy_val, fuzzy_compare=fuzzy_compare))
continue
predicate = cls(len(extra_predicates), attr, value, type_, vocab=vocab,
regex=regex, fuzzy=fuzzy, fuzzy_compare=fuzzy_compare)
# Don't create redundant predicates.
# This helps with efficiency, as we're caching the results.
if predicate.key in seen_predicates:
output.append(seen_predicates[predicate.key])
else:
extra_predicates.append(predicate)
output.append(predicate.i)
seen_predicates[predicate.key] = predicate.i
return output


Expand Down
Loading