Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove_accents option does not work properly anymore #444

Closed
alainv62 opened this issue Jan 8, 2023 · 8 comments · Fixed by #464
Closed

remove_accents option does not work properly anymore #444

alainv62 opened this issue Jan 8, 2023 · 8 comments · Fixed by #464

Comments

@alainv62
Copy link

alainv62 commented Jan 8, 2023

Since the removal of the unidecode library and its replacement with the unicodedata module in commit 4d517d1, the remove_accents option does not work properly anymore.
Eg: in French, 'référence' is replaced with 'rfrence'.
It seems that the normal form KC is here responsible as the normal form KD works fine with this example ('référence' is properly replaced with 'reference').

@esmeraldas63
Copy link
Contributor

Have encountered same issues with Lithuanian and Latvian letters too

@bosd
Copy link
Collaborator

bosd commented Jan 13, 2023

We need to fix this! Currently im travelling. So not much time to attend to this.
Does anyone have a suggestion for an alternative implementation or library?

@alainv62
Copy link
Author

Why not just revert to the former solution optimized_str = unidecode(optimized_str) which was working fine?

@bosd
Copy link
Collaborator

bosd commented Jan 14, 2023

Why not just revert to the former solution ?

Because of the mentioned license issue.

There are some good alternatives. Like built-ins or alternative libraries ( adding extra dependencies is not preferred).

@alainv62
Copy link
Author

What about using the normal form KD ?

@rmilecki
Copy link
Collaborator

Why not just revert to the former solution optimized_str = unidecode(optimized_str) which was working fine?

See description in pull request that introduced that change: #436

And for the orginal report: #435

@bosd
Copy link
Collaborator

bosd commented Feb 5, 2023

What about using the normal form KD ?

Yes, that seems to fix it..
small test:

>>> str1="référence"
>>> unicodedata.normalize('NFKD', str1).encode('ascii', 'ignore').decode('ascii')
'reference'
>>> str2="ä"
>>> unicodedata.normalize('NFKD', str2).encode('ascii', 'ignore').decode('ascii')
'a'

@bosd
Copy link
Collaborator

bosd commented Feb 5, 2023

It is probably an good idea to add a test for the remove accents function.
Is there an example invoice to add to the tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants