Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asian languages usage #7

Closed
T1-Tolki opened this issue Mar 27, 2020 · 12 comments
Closed

Asian languages usage #7

T1-Tolki opened this issue Mar 27, 2020 · 12 comments
Labels
enhancement New feature or request

Comments

@T1-Tolki
Copy link

Hello,

Not an issue per say, but I was curious about possible asian languages compatibility. Chinese, Korean, and Japanese all perform very poorly with Levenshtein distance matching because they’re not alphabet-based.

A solution would be to use a romanisation library to translate them to alphabet characters in the same way that all characters are put in lowercase.

I’d love to give a hand but unfortunately my cpp is pretty rusty and I’d do more harm than good.

@maxbachmann maxbachmann added the enhancement New feature or request label Mar 27, 2020
@maxbachmann
Copy link
Member

maxbachmann commented Mar 27, 2020

Hm I suppose this would have to be on a language base. So it makes most sence to add some language based preprocessors for the sentences. E.g. when using process.extract this can be any Callable using the argument processor which will then be used instead. So you are not limited to processors this library provides and neither do they have to be implemented in C++ (e.g. for the current default processor that strips and lowercases the processor is both implemented in python and C++ since calling the C++ Processor from Python has some overhead which makes it slower)

Does it make sence to provide these processors with the library, or would it make more sence to provide a better guide for the processor possibly even with links to good libraries for specific languages so people can simply install this extra library and pass it as a processor into the function?

@T1-Tolki
Copy link
Author

I think it could make sense to provide those options as [extras] for example, which could then even try to guess the language by looking at the first non alphabetical character.

I will take a look at it myself as it might be too related to my own use cases.

@maxbachmann
Copy link
Member

Hm I am absolutely fine with placing this in the library so the user can simply select e.g. a korean preprocessor. Since I personally do not have a clue about these languages I think this is probably best done by someone who knows something about these languages.

This could simply be another function in here: https://github.com/rhasspy/rapidfuzz/blob/master/python/src/rapidfuzz/utils.py

So when your implementing something like this for a language I would love to add it to the library ;)

@mrtolkien
Copy link

mrtolkien commented Mar 28, 2020

Currently, the python processor is only called when not using any part of the C++ code.

Would it be possible for the C++ routine to actually return a list index + score instead of a string + score so processors could be applied before passing the variables to the C++ extractOne function? I don’t think the overhead would be significant and it would allow for much more customisation.

See this branch:
https://github.com/mrtolkien/rapidfuzz/tree/romanisation

@maxbachmann
Copy link
Member

maxbachmann commented Mar 28, 2020

Thats already done.
Here is a commented version of the current extractOne

when there is no scorer or the scorer is the default scorer and there is no processor or it is the default processor directly call the cpp extractOne as a optimisation for the default case, since there is some overhead involved in calling C++ from Python

    if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
        return _rapidfuzz_cpp.process.extractOne(query, list(choices), score_cutoff, bool(processor))

When it is not the default case simply implement the extractOne inPython -> basically a copy paste of the C++ code in Python, since while passing them to C++ and then calling them from there would mean to call
Python -> C++ -> (Python -> C++)Loop

While when implementing it in Python it is
(Python -> C++)Loop
which saves a little bit overhead

    # evaluate score inside python since scorer is a python function and so it would be required
    # to add the python layer from C++ aswell
    a = processor(query) if processor else query
    match_found = False
    result_choice = ""

    for choice in choices:
        b = processor(choice) if processor else choice

        score = scorer(a, b, score_cutoff, False)
        if score >= score_cutoff:
            score_cutoff = score
            match_found = True
            result_choice = choice

    return (result_choice, score_cutoff) if match_found else None

So when you pass any non default it is already implemented in Python and as you describe while there is a overhead it is not super significant

To give some context here is the result of a test with 1.000.000 elements (the first one is one that is a good match and therefore very optimised and does no preprocessing since in many cases thats something you could do once and then use the cleaned up list of strings mutliple times afterwards

print(timeit(
"""
process.extractOne(query, choices, scorer=fuzz.WRatio, processor=None)
""",
setup=
"""
from rapidfuzz import process
from rapidfuzz import fuzz
query = "please add bananas to my shopping list"
choices = ["can you add bananas to my shopping list please" for _ in range(1000000)]
""",
number=1
))

print(timeit(
"""
process.extractOne(query, choices)
""",
setup=
"""
from rapidfuzz import process
query = "please add bananas to my shopping list"
choices = ["whats the weather like in Paris" for _ in range(1000000)]
""",
number=1
))

direct call to the extractOne in C++

0.3276703062001616
5.064501917921007

Python Implementation

0.46753298211842775
5.5052065760828555

Btw the same with fuzzywuzzy[speedup] (so the faster version of fuzzywuzzy)

31.484403903596103
28.21531048975885

This is why I decided to go with the faster option for the default case, while still allowing the customisation you describe

@mrtolkien
Copy link

mrtolkien commented Mar 28, 2020

I see. In my fork, I am using a list comprehension to apply the processor automatically to all the possible choices then passing this processed list to the cpp function. In the case of an absence of custom scorer function, wouldn’t that be faster?

    if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
        return _rapidfuzz_cpp.process.extractOne(processor(query),
                                                 [processor(c) for c in choices],
                                                 score_cutoff,
                                                 False)

If this returned a list index with the ratio we would be able to keep current functionality while still using c++ for the levenshtein distance.

@maxbachmann
Copy link
Member

Let met try and as for your question in the commit message use:

   if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
        return _rapidfuzz_cpp.process.extractOne(processor(query),
                                                 [processor(c) for c in choices],
                                                 score_cutoff,
                                                 False)

@maxbachmann
Copy link
Member

maxbachmann commented Mar 28, 2020

I did test unidecode and I am not sure whether this should actually be the default behaviour for two reasons:

  1. performance:
    On my 1m elements even when nothing has to be done this costs me about 2 seconds extrac (thats between 1.5 times and 7 times the runtime of before!)
    When you need to do this, then I think paying this price is absolutely fine and it is still a lot faster than fuzzywuzzy, however not everyone e.g. with english needs this. So I would prefer this to be a second processor, that users can pass when they want this behaviour. (This deserves a explanation on how to use it in the README however)

  2. While in chinese these results might make sence this is not necessarily always the case. As an example in french there are the words sale » (dirty) and « salé » (salty). So this would make salty equal to dirty when using unihandecode. I think that this can be to aggressive for a replacement in the default behaviour

Since Unihandecoder defaults to chinese I think it would be good to allow providing the language aswell

@lingvisa
Copy link

Is there a solution for Asian including Chinese for this issue? I also want to use RapidFuzz for Chinese too.

@maxbachmann
Copy link
Member

The library supports matching for any kind of unicode characters which should include chinese. Even the preprocess function supports unicode now.

@lingvisa
Copy link

But the quality seems very bad and I will test more.

@mrtolkien
Copy link

@maxbachmann this is not relevant for those languages as the Levenstein distance doesn't make sense with the characters directly. There needs to be some kind of romanisation for it to make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants