-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend process module #188
Comments
The api will follow the following scheme: class Compare:
__init__(scorer, *):
def set_seq1(a): -> single
def set_seq1_list(a): -> many
def set_seq2(b): -> single
def set_seq2_list(b): -> many
def all()
def max(axis=None, initial=..., keepdims=False)
def sorted(axis=None, keepdims=False, limit=None) |
How about calling it extractMany? I would love to have this feature soon :) |
Is there any chance it'll be in any of the upcoming releases or you're not yet interested in implementing this? Thank you :) |
I am still interested in adding this feature. However I do not expect that I will have the time to implement this until later this year. |
For the Options I am considering so far:
I do not have a lot of experience with the whole scientific Python stack. Is there a specific format which would make sense for the results to simplify further processing. @Unc3nZureD do you have any preference for the result format? |
Sorry for the huge delay. Sadly I have no preference as I'm not familiar with them at all. Currently, it could help me by moving away from the case where I've got let's say 100 patterns and a "database" of patterns, let's say with 1000 ones inside. Currently, I'm looping the 100 patterns on a single python thread and each time checking the DB. It'd be a lot more optimal if some multithreaded C/++ code could do it and just return an array of bool or float (depending if I need the actual value or not) Honestly speaking I'm not even sure why scientific libs would be necessary, but it may be due to I'm a beginner. Thank you! |
Some things about this could be simplified:
class Compare:
__init__(choices, *, scorer, processor=None, scorer_kwargs=None):
def all(queries)
def max(queries, axis=None, initial=..., keepdims=False)
def sorted(queries, axis=None, keepdims=False, limit=None) |
I think multi-string inputs (i.e. many-to-many matches) would be a great addition to Finding the best N matching strings for a single input string can already be done via However, allowing single-call-extractions for multi-string-inputs via the The use case I can imagine (and currently have) is to identify, based on customer names, whether there are relations between a set of customers and which other customers any given input customer may be related to. Side question: Aren't |
Question about However, in the source code (https://github.com/rapidfuzz/RapidFuzz/blob/main/src/rapidfuzz/process_py.py), |
Yes I absolutely see why this class of APIs would be useful. While as you described it's possible to either call
Historically
This is the Python fallback implementation where it doesn't really make that much of a difference and which is going to be significantly slower anyway. The C++ implementation in https://github.com/rapidfuzz/RapidFuzz/blob/main/src/rapidfuzz/process_cpp_impl.pyx doesn't do this. The performance difference between
|
Turns out, the cause was mainly the best-score-short-circuit (which I hadn't known about), since I had the string itself at the beginning of That being said, for matching 161k strings among themselves by repeatedly calling On the other hand, a self-written argmax-based wrapper around |
which |
@maxbachmann: I installed rapidfuzz 3.9.7 today and later switched back to 3.9.3. Unfortunately I can't seem to reproduce the timings from yesterday (which were also in 3.9.3), for whatever reason. However, I created a new timing script, which - apart from loading the strings - is shown below. There are two test cases:
In both test cases, the performance of Code:
Terminal output: Test case A: The timings of Test case B: Here extractOne() and extract(..., limit=1) take much longer than in test case A, being much closer to Summary: In test case A, |
Currently the process module has the following functions:
It would be nice to have equivalents of
extractOne
/extract
formany x many
. They would need less memory thancdist
, which can take a large amount of memory whenlen(queries)
andlen(choices)
are large.A first thought might be to overload the existing
extractOne
/extract
on the type passed asquery
/queries
. However this is not possible, since the following is a valid usage of these methods:which can not be distinguished from
many x many
. For this reason these functions need a new API.Beside this in many cases users are not actually interested, but only care about finding elements with a score, which is better than the score_cutoff. These could potentially be implemented more efficiently, since the implementation could quit once it is known, that they are better than
score_cutoff
. These could be cases:This could be automatically done when the user passes
dtype=bool
.Any suggestions on the naming of these new API's are welcome.
The text was updated successfully, but these errors were encountered: