Asian languages usage #7

T1-Tolki · 2020-03-27T14:14:09Z

Hello,

Not an issue per say, but I was curious about possible asian languages compatibility. Chinese, Korean, and Japanese all perform very poorly with Levenshtein distance matching because they’re not alphabet-based.

A solution would be to use a romanisation library to translate them to alphabet characters in the same way that all characters are put in lowercase.

I’d love to give a hand but unfortunately my cpp is pretty rusty and I’d do more harm than good.

maxbachmann · 2020-03-27T14:49:05Z

Hm I suppose this would have to be on a language base. So it makes most sence to add some language based preprocessors for the sentences. E.g. when using process.extract this can be any Callable using the argument processor which will then be used instead. So you are not limited to processors this library provides and neither do they have to be implemented in C++ (e.g. for the current default processor that strips and lowercases the processor is both implemented in python and C++ since calling the C++ Processor from Python has some overhead which makes it slower)

Does it make sence to provide these processors with the library, or would it make more sence to provide a better guide for the processor possibly even with links to good libraries for specific languages so people can simply install this extra library and pass it as a processor into the function?

T1-Tolki · 2020-03-27T15:15:51Z

I think it could make sense to provide those options as [extras] for example, which could then even try to guess the language by looking at the first non alphabetical character.

I will take a look at it myself as it might be too related to my own use cases.

maxbachmann · 2020-03-27T15:35:13Z

Hm I am absolutely fine with placing this in the library so the user can simply select e.g. a korean preprocessor. Since I personally do not have a clue about these languages I think this is probably best done by someone who knows something about these languages.

This could simply be another function in here: https://github.com/rhasspy/rapidfuzz/blob/master/python/src/rapidfuzz/utils.py

So when your implementing something like this for a language I would love to add it to the library ;)

mrtolkien · 2020-03-28T10:01:19Z

Currently, the python processor is only called when not using any part of the C++ code.

Would it be possible for the C++ routine to actually return a list index + score instead of a string + score so processors could be applied before passing the variables to the C++ extractOne function? I don’t think the overhead would be significant and it would allow for much more customisation.

See this branch:
https://github.com/mrtolkien/rapidfuzz/tree/romanisation

maxbachmann · 2020-03-28T11:46:40Z

Thats already done.
Here is a commented version of the current extractOne

when there is no scorer or the scorer is the default scorer and there is no processor or it is the default processor directly call the cpp extractOne as a optimisation for the default case, since there is some overhead involved in calling C++ from Python

    if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
        return _rapidfuzz_cpp.process.extractOne(query, list(choices), score_cutoff, bool(processor))

When it is not the default case simply implement the extractOne inPython -> basically a copy paste of the C++ code in Python, since while passing them to C++ and then calling them from there would mean to call
Python -> C++ -> (Python -> C++)Loop

While when implementing it in Python it is
(Python -> C++)Loop
which saves a little bit overhead

    # evaluate score inside python since scorer is a python function and so it would be required
    # to add the python layer from C++ aswell
    a = processor(query) if processor else query
    match_found = False
    result_choice = ""

    for choice in choices:
        b = processor(choice) if processor else choice

        score = scorer(a, b, score_cutoff, False)
        if score >= score_cutoff:
            score_cutoff = score
            match_found = True
            result_choice = choice

    return (result_choice, score_cutoff) if match_found else None

So when you pass any non default it is already implemented in Python and as you describe while there is a overhead it is not super significant

To give some context here is the result of a test with 1.000.000 elements (the first one is one that is a good match and therefore very optimised and does no preprocessing since in many cases thats something you could do once and then use the cleaned up list of strings mutliple times afterwards

print(timeit(
"""
process.extractOne(query, choices, scorer=fuzz.WRatio, processor=None)
""",
setup=
"""
from rapidfuzz import process
from rapidfuzz import fuzz
query = "please add bananas to my shopping list"
choices = ["can you add bananas to my shopping list please" for _ in range(1000000)]
""",
number=1
))

print(timeit(
"""
process.extractOne(query, choices)
""",
setup=
"""
from rapidfuzz import process
query = "please add bananas to my shopping list"
choices = ["whats the weather like in Paris" for _ in range(1000000)]
""",
number=1
))

direct call to the extractOne in C++

0.3276703062001616
5.064501917921007

Python Implementation

0.46753298211842775
5.5052065760828555

Btw the same with fuzzywuzzy[speedup] (so the faster version of fuzzywuzzy)

31.484403903596103
28.21531048975885

This is why I decided to go with the faster option for the default case, while still allowing the customisation you describe

mrtolkien · 2020-03-28T11:54:59Z

I see. In my fork, I am using a list comprehension to apply the processor automatically to all the possible choices then passing this processed list to the cpp function. In the case of an absence of custom scorer function, wouldn’t that be faster?

    if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
        return _rapidfuzz_cpp.process.extractOne(processor(query),
                                                 [processor(c) for c in choices],
                                                 score_cutoff,
                                                 False)

If this returned a list index with the ratio we would be able to keep current functionality while still using c++ for the levenshtein distance.

maxbachmann · 2020-03-28T11:56:38Z

Let met try and as for your question in the commit message use:

   if (not scorer or scorer == fuzz.WRatio) and (not processor or processor == utils.default_process):
        return _rapidfuzz_cpp.process.extractOne(processor(query),
                                                 [processor(c) for c in choices],
                                                 score_cutoff,
                                                 False)

maxbachmann · 2020-03-28T12:29:12Z

I did test unidecode and I am not sure whether this should actually be the default behaviour for two reasons:

performance:
On my 1m elements even when nothing has to be done this costs me about 2 seconds extrac (thats between 1.5 times and 7 times the runtime of before!)
When you need to do this, then I think paying this price is absolutely fine and it is still a lot faster than fuzzywuzzy, however not everyone e.g. with english needs this. So I would prefer this to be a second processor, that users can pass when they want this behaviour. (This deserves a explanation on how to use it in the README however)
While in chinese these results might make sence this is not necessarily always the case. As an example in french there are the words sale » (dirty) and « salé » (salty). So this would make salty equal to dirty when using unihandecode. I think that this can be to aggressive for a replacement in the default behaviour

Since Unihandecoder defaults to chinese I think it would be good to allow providing the language aswell

lingvisa · 2022-09-21T22:49:42Z

Is there a solution for Asian including Chinese for this issue? I also want to use RapidFuzz for Chinese too.

maxbachmann · 2022-09-21T22:57:13Z

The library supports matching for any kind of unicode characters which should include chinese. Even the preprocess function supports unicode now.

lingvisa · 2022-09-21T23:52:24Z

But the quality seems very bad and I will test more.

mrtolkien · 2022-09-21T23:55:20Z

@maxbachmann this is not relevant for those languages as the Levenstein distance doesn't make sense with the characters directly. There needs to be some kind of romanisation for it to make sense.

maxbachmann added the enhancement New feature or request label Mar 27, 2020

maxbachmann closed this as completed Nov 26, 2021

maxbachmann mentioned this issue Sep 22, 2022

add support for romanisation #267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asian languages usage #7

Asian languages usage #7

T1-Tolki commented Mar 27, 2020

maxbachmann commented Mar 27, 2020 •

edited

Loading

T1-Tolki commented Mar 27, 2020

maxbachmann commented Mar 27, 2020

mrtolkien commented Mar 28, 2020 •

edited

Loading

maxbachmann commented Mar 28, 2020 •

edited

Loading

mrtolkien commented Mar 28, 2020 •

edited

Loading

maxbachmann commented Mar 28, 2020

maxbachmann commented Mar 28, 2020 •

edited

Loading

lingvisa commented Sep 21, 2022

maxbachmann commented Sep 21, 2022

lingvisa commented Sep 21, 2022

mrtolkien commented Sep 21, 2022

Asian languages usage #7

Asian languages usage #7

Comments

T1-Tolki commented Mar 27, 2020

maxbachmann commented Mar 27, 2020 • edited Loading

T1-Tolki commented Mar 27, 2020

maxbachmann commented Mar 27, 2020

mrtolkien commented Mar 28, 2020 • edited Loading

maxbachmann commented Mar 28, 2020 • edited Loading

mrtolkien commented Mar 28, 2020 • edited Loading

maxbachmann commented Mar 28, 2020

maxbachmann commented Mar 28, 2020 • edited Loading

lingvisa commented Sep 21, 2022

maxbachmann commented Sep 21, 2022

lingvisa commented Sep 21, 2022

mrtolkien commented Sep 21, 2022

maxbachmann commented Mar 27, 2020 •

edited

Loading

mrtolkien commented Mar 28, 2020 •

edited

Loading

maxbachmann commented Mar 28, 2020 •

edited

Loading

mrtolkien commented Mar 28, 2020 •

edited

Loading

maxbachmann commented Mar 28, 2020 •

edited

Loading