Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for german Umlauts #2146

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Support for german Umlauts #2146

wants to merge 1 commit into from

Conversation

c-yco
Copy link

@c-yco c-yco commented Mar 8, 2015

Instead of replacing Umlauts (ö,ä,ü) with o,a,u, replace them with oe, ae and ue as this is the usual way to replace these umlauts. So for example 'Die Ärzte ' gets 'Die Aerzte' instead of 'Die Arzte'.
This way usualy more results are found.

@basilfx
Copy link
Contributor

basilfx commented Mar 8, 2015

I don't think this transformation applies for all languages. Can you verify?

@c-yco
Copy link
Author

c-yco commented Mar 8, 2015

I have done some research, it looks like that this letters mostly occur in german and in a smaller part in some other european languages (swedish for example : 'Björk'). I have looked up some releases for bjoerk, got me some descent results. So I think for these 3 letters it is best practise to substitute them this way. Perhaps somebody else can also advise on this?

Perfect way would be to search for both versions, so do 2 search runs. But I am currently not so deep into the code to accomplish this..

@dpons039
Copy link

Hello,
Please, keep in mind that not all the languages do the same replace. In Spanish ü = u

I would rather perform 2 searches with "u" and "ue" when searcching.

Similar happens with the Ñ and the torrents, the options that people use are: "ni" or "ny" or "n"

@andrzejc
Copy link
Contributor

Most foreigners have no idea how to transliterate diacritics so they'll write 'u' instead of 'ue'. People with correct keyboard layout will OTOH write correctly 'ü' etc. That's why I think, if any transliteration should be done, it should be based on graphical similarity, not the actual phoneme. My approach to this issue would be to base the search not solely on identity of cleanName() result but rather similarity, i.e. Levenshtein distance between the strings, modified as to count distance between diacritics as e.g. half of the normal value. This way the correct spelling of the band name 'nÄo' would have higher score than transliterated 'nao' but eventually both would hit. I already started some of this work in this PR: #2551, however currently it's only update to the transliteration, as the rework of entire track/album/artist name matching is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants