Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for romanisation #267

Open
maxbachmann opened this issue Sep 22, 2022 · 4 comments
Open

add support for romanisation #267

maxbachmann opened this issue Sep 22, 2022 · 4 comments
Labels
discussion Up to discussion enhancement New feature or request help wanted Extra attention is needed

Comments

@maxbachmann
Copy link
Member

As described in #7 metrics like the levenshtein distance only make much sense for langauges like chinese, if there is support for romanisation.

@mrtolkien @lingvisa I opened this new issue to track support for romanisation. Note that:

  1. I am unsure how this would optimally be implemented to help the largest amount of people
  2. I do not think I have time to implement this myself soon, but someone else might want to pick this up. Especially now that there is a Python only mode it would be enough to implement this in pure Python for now (I can do porting to C++ for better performance)

This should be implemented as a separate preprocessing function for the current default_process method.

@maxbachmann maxbachmann added enhancement New feature or request help wanted Extra attention is needed discussion Up to discussion labels Sep 22, 2022
@maxbachmann
Copy link
Member Author

maxbachmann commented Sep 22, 2022

Note that I am unsure how simple/hard romanisation is depending on the language, since I have zero experience with languages that need this sort of preprocessing. So any solution making it into RapidFuzz would need to be:

  1. simple enough for even me to maintain
  2. not generate tons of issues due to suboptimal romanisation in some cases (which depending on the language are probably going to occur)

Depending on the amount of work this requires, it might make sense to make this a separate project. This is really not an integral step of the matching but a preprocessing step, which is likely helpful to users in and of itself (probably some projects for this already exist).
Note that I have a C-API for preprocessing function which would even allow you to achieve this without any performance loss compared to a built in implementation.

I would be happy to mention these solutions in my documentation to help users coming from a language benefiting from romanisation.

@shirakaba
Copy link

shirakaba commented Jul 10, 2023

Depending on the amount of work this requires, it might make sense to make this a separate project.

This feels out-of-scope for RapidFuzz, because transcribing non-Roman languages is a totally separate problem-space. I think users should just do it separately and pass in the inputs to RapidFuzz, because then they will have complete freedom of implementation – there are many ways to transcribe, each with different tradeoffs, and none are perfect.

I'll give an example for Japanese, but a similar approach could be taken for Chinese.

Getting the pronunciation of Japanese text

Getting the phonetic transcriptions for Japanese is a straightforward process, but you'll need some pretty heavy dependencies for it.

Installation

  • fugashi is a morphological analyser for Japanese. It's just a Python wrapper around MeCab.
  • unidic is a very large (770 MB) dictionary file that provides MeCab with the token data needed to segment Japanese text.
pip install fugashi
pip install unidic

# Warning: the download for UniDic is around 770 MB!
python -m unidic download

Usage

from fugashi import GenericTagger
import unidic

tagger = GenericTagger('-d "{}"'.format(unidic.DICDIR))


def get_pronunciation(text, tagger):
    acc = ""
    pron_index = 9
    for word in tagger(text):
        pron = (
            word.feature[pron_index]
            if len(word.feature) > pron_index
            else word.surface
        )
        if pron == "*":
            pron = word.surface
        acc = acc + pron
    return acc

print(get_pronunciation("東京に住む。"))
# "トーキョーニスム。"

From there, you'd need a separate library to map the phonetic (katakana) characters to Roman characters – but actually just getting them as far as phonetic characters could be enough for your purposes.

@jpenney
Copy link

jpenney commented Jul 28, 2023

For Japanese cutlet runs on top of fugashi and could probably be used in a preprocessing function. It's a bit heavy needing unidic or unidic-lite, but maybe an example in the documentation would be enough?

@maxbachmann
Copy link
Member Author

I think a documentation section on options for romanisation for different languages would make sense. It is a fairly common thing people run into when matching non roman-languages and so having some documentation for this would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Up to discussion enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants