-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asian languages usage #7
Comments
Hm I suppose this would have to be on a language base. So it makes most sence to add some language based preprocessors for the sentences. E.g. when using Does it make sence to provide these processors with the library, or would it make more sence to provide a better guide for the processor possibly even with links to good libraries for specific languages so people can simply install this extra library and pass it as a processor into the function? |
I think it could make sense to provide those options as [extras] for example, which could then even try to guess the language by looking at the first non alphabetical character. I will take a look at it myself as it might be too related to my own use cases. |
Hm I am absolutely fine with placing this in the library so the user can simply select e.g. a korean preprocessor. Since I personally do not have a clue about these languages I think this is probably best done by someone who knows something about these languages. This could simply be another function in here: https://github.com/rhasspy/rapidfuzz/blob/master/python/src/rapidfuzz/utils.py So when your implementing something like this for a language I would love to add it to the library ;) |
Currently, the python processor is only called when not using any part of the C++ code. Would it be possible for the C++ routine to actually return a list index + score instead of a string + score so processors could be applied before passing the variables to the C++ extractOne function? I don’t think the overhead would be significant and it would allow for much more customisation. See this branch: |
Thats already done. when there is no scorer or the scorer is the default scorer and there is no processor or it is the default processor directly call the cpp extractOne as a optimisation for the default case, since there is some overhead involved in calling C++ from Python if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
return _rapidfuzz_cpp.process.extractOne(query, list(choices), score_cutoff, bool(processor)) When it is not the default case simply implement the extractOne inPython -> basically a copy paste of the C++ code in Python, since while passing them to C++ and then calling them from there would mean to call While when implementing it in Python it is # evaluate score inside python since scorer is a python function and so it would be required
# to add the python layer from C++ aswell
a = processor(query) if processor else query
match_found = False
result_choice = ""
for choice in choices:
b = processor(choice) if processor else choice
score = scorer(a, b, score_cutoff, False)
if score >= score_cutoff:
score_cutoff = score
match_found = True
result_choice = choice
return (result_choice, score_cutoff) if match_found else None So when you pass any non default it is already implemented in Python and as you describe while there is a overhead it is not super significant To give some context here is the result of a test with 1.000.000 elements (the first one is one that is a good match and therefore very optimised and does no preprocessing since in many cases thats something you could do once and then use the cleaned up list of strings mutliple times afterwards print(timeit(
"""
process.extractOne(query, choices, scorer=fuzz.WRatio, processor=None)
""",
setup=
"""
from rapidfuzz import process
from rapidfuzz import fuzz
query = "please add bananas to my shopping list"
choices = ["can you add bananas to my shopping list please" for _ in range(1000000)]
""",
number=1
))
print(timeit(
"""
process.extractOne(query, choices)
""",
setup=
"""
from rapidfuzz import process
query = "please add bananas to my shopping list"
choices = ["whats the weather like in Paris" for _ in range(1000000)]
""",
number=1
)) direct call to the extractOne in C++
Python Implementation
Btw the same with fuzzywuzzy[speedup] (so the faster version of fuzzywuzzy)
This is why I decided to go with the faster option for the default case, while still allowing the customisation you describe |
I see. In my fork, I am using a list comprehension to apply the processor automatically to all the possible choices then passing this processed list to the cpp function. In the case of an absence of custom scorer function, wouldn’t that be faster? if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
return _rapidfuzz_cpp.process.extractOne(processor(query),
[processor(c) for c in choices],
score_cutoff,
False) If this returned a list index with the ratio we would be able to keep current functionality while still using c++ for the levenshtein distance. |
Let met try and as for your question in the commit message use:
|
I did test unidecode and I am not sure whether this should actually be the default behaviour for two reasons:
Since Unihandecoder defaults to chinese I think it would be good to allow providing the language aswell |
Is there a solution for Asian including Chinese for this issue? I also want to use RapidFuzz for Chinese too. |
The library supports matching for any kind of unicode characters which should include chinese. Even the preprocess function supports unicode now. |
But the quality seems very bad and I will test more. |
@maxbachmann this is not relevant for those languages as the Levenstein distance doesn't make sense with the characters directly. There needs to be some kind of romanisation for it to make sense. |
Hello,
Not an issue per say, but I was curious about possible asian languages compatibility. Chinese, Korean, and Japanese all perform very poorly with Levenshtein distance matching because they’re not alphabet-based.
A solution would be to use a romanisation library to translate them to alphabet characters in the same way that all characters are put in lowercase.
I’d love to give a hand but unfortunately my cpp is pretty rusty and I’d do more harm than good.
The text was updated successfully, but these errors were encountered: