Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird unicode characters #7

Open
devsorice opened this issue Nov 19, 2022 · 2 comments
Open

Weird unicode characters #7

devsorice opened this issue Nov 19, 2022 · 2 comments

Comments

@devsorice
Copy link

Hi just out of curiosity, since i'm building a dictionary app, i found out about this problem.
Basically there are some duplicated characters in unicode (i don't know how one would input those, but they have a different encoding than the ones from the standard japanese keyboard)
Basically when you search on jisho.org with this alternative version of the characters you don't find anything.
So i don't know if this impacts regular users or not, or if is standard practice to convert the character or not(i.e. normalization https://unicode.org/faq/normalization.html)

Example 1
Search for 金 https://jisho.org/search/%EF%A4%8A No result
Search for 金 https://jisho.org/search/%E9%87%91 997 Results

Example 2
https://jisho.org/search/%EF%A4%82 No result
https://jisho.org/search/%E8%BB%8A

You can browse the full list
https://en.wikipedia.org/wiki/CJK_Compatibility_Ideographs

@Kimtaro
Copy link
Owner

Kimtaro commented Nov 24, 2022

Hi @devsorice

I read up a bit on this in CJKV Information Processing (page 167+). Since these characters are subject to Unicode Normalization they might get automatically normalized by OS/browsers. So I did some testing on this to see if that normalization is applied before reaching Jisho.

Safari on macOS - normalized
Firefox on macOS - not normalized
Chrome on macOS - not normalized
Edge on Win11 (virtualized on macOS) - normalized
Chrome on Win11 (virtualized on macOS) - normalized

So it seems like in most cases the compatibility ideograph is being normalized to the unified version and thus gives search results as expected.

But given that that is not the case for some OS/browser combo's I will consider adding a normalization step in the new version of Jisho that I'm working on.

Thanks for pointing this out. I hope this helps your development efforts as well :)

@Kimtaro
Copy link
Owner

Kimtaro commented Nov 24, 2022

I'll add that I think I have gotten one other question about these compatibility ideographs over the close to 20 years of running Jisho, so I don't think it's a very common issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants