add support for romanisation #267

maxbachmann · 2022-09-22T00:11:41Z

As described in #7 metrics like the levenshtein distance only make much sense for langauges like chinese, if there is support for romanisation.

@mrtolkien @lingvisa I opened this new issue to track support for romanisation. Note that:

I am unsure how this would optimally be implemented to help the largest amount of people
I do not think I have time to implement this myself soon, but someone else might want to pick this up. Especially now that there is a Python only mode it would be enough to implement this in pure Python for now (I can do porting to C++ for better performance)

This should be implemented as a separate preprocessing function for the current default_process method.

The text was updated successfully, but these errors were encountered:

maxbachmann · 2022-09-22T00:42:39Z

Note that I am unsure how simple/hard romanisation is depending on the language, since I have zero experience with languages that need this sort of preprocessing. So any solution making it into RapidFuzz would need to be:

simple enough for even me to maintain
not generate tons of issues due to suboptimal romanisation in some cases (which depending on the language are probably going to occur)

Depending on the amount of work this requires, it might make sense to make this a separate project. This is really not an integral step of the matching but a preprocessing step, which is likely helpful to users in and of itself (probably some projects for this already exist).
Note that I have a C-API for preprocessing function which would even allow you to achieve this without any performance loss compared to a built in implementation.

I would be happy to mention these solutions in my documentation to help users coming from a language benefiting from romanisation.

shirakaba · 2023-07-10T04:56:22Z

Depending on the amount of work this requires, it might make sense to make this a separate project.

This feels out-of-scope for RapidFuzz, because transcribing non-Roman languages is a totally separate problem-space. I think users should just do it separately and pass in the inputs to RapidFuzz, because then they will have complete freedom of implementation – there are many ways to transcribe, each with different tradeoffs, and none are perfect.

I'll give an example for Japanese, but a similar approach could be taken for Chinese.

Getting the pronunciation of Japanese text

Getting the phonetic transcriptions for Japanese is a straightforward process, but you'll need some pretty heavy dependencies for it.

Installation

fugashi is a morphological analyser for Japanese. It's just a Python wrapper around MeCab.
unidic is a very large (770 MB) dictionary file that provides MeCab with the token data needed to segment Japanese text.

pip install fugashi
pip install unidic

# Warning: the download for UniDic is around 770 MB!
python -m unidic download

Usage

from fugashi import GenericTagger
import unidic

tagger = GenericTagger('-d "{}"'.format(unidic.DICDIR))


def get_pronunciation(text, tagger):
    acc = ""
    pron_index = 9
    for word in tagger(text):
        pron = (
            word.feature[pron_index]
            if len(word.feature) > pron_index
            else word.surface
        )
        if pron == "*":
            pron = word.surface
        acc = acc + pron
    return acc

print(get_pronunciation("東京に住む。"))
# "トーキョーニスム。"

From there, you'd need a separate library to map the phonetic (katakana) characters to Roman characters – but actually just getting them as far as phonetic characters could be enough for your purposes.

jpenney · 2023-07-28T12:41:13Z

For Japanese cutlet runs on top of fugashi and could probably be used in a preprocessing function. It's a bit heavy needing unidic or unidic-lite, but maybe an example in the documentation would be enough?

maxbachmann · 2023-08-02T13:28:58Z

I think a documentation section on options for romanisation for different languages would make sense. It is a fairly common thing people run into when matching non roman-languages and so having some documentation for this would be useful.

maxbachmann added enhancement New feature or request help wanted Extra attention is needed discussion Up to discussion labels Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for romanisation #267

add support for romanisation #267

maxbachmann commented Sep 22, 2022

maxbachmann commented Sep 22, 2022 •

edited

Loading

shirakaba commented Jul 10, 2023 •

edited

Loading

jpenney commented Jul 28, 2023

maxbachmann commented Aug 2, 2023

add support for romanisation #267

add support for romanisation #267

Comments

maxbachmann commented Sep 22, 2022

maxbachmann commented Sep 22, 2022 • edited Loading

shirakaba commented Jul 10, 2023 • edited Loading

Getting the pronunciation of Japanese text

Installation

Usage

jpenney commented Jul 28, 2023

maxbachmann commented Aug 2, 2023

maxbachmann commented Sep 22, 2022 •

edited

Loading

shirakaba commented Jul 10, 2023 •

edited

Loading