Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persian isn't supported. #42

Open
niyumard opened this issue Jan 8, 2020 · 11 comments
Open

Persian isn't supported. #42

niyumard opened this issue Jan 8, 2020 · 11 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@niyumard
Copy link

niyumard commented Jan 8, 2020

Hi, I see Persian texts like this using stutter:
image
image

It seems stutter doesn't support RTL languages and doesn't use a suitable font for them.

@jamestomasino
Copy link
Owner

Hi @niyumard ! Thanks for logging the issue. Right now you're absolutely right. Not only that, but the way I'm breaking up word-forms is specifically English based. I don't have the knowledge to re-implement that in a way that would support other languages, let alone RTL ones.

That being said, it's about time I look into finding a way to allow for localization to be submitted. If I can make it easy for others to contribute their own language parts we can start tackling this.

@jamestomasino jamestomasino added enhancement New feature or request help wanted Extra attention is needed labels Jan 8, 2020
@jamestomasino jamestomasino self-assigned this Jan 8, 2020
@jamestomasino
Copy link
Owner

While I haven't really made progress on Persian, I have added some basic locale support to Stutter. More work will be required to handle RTL, but if you have any LTR languages you want to work with, then all you need to do is modify the JSON object at the top of parts.js. i found a list of common prefixes and suffixes for Spanish, and left all other word-splitting behavior the same as English. Hopefully others will manage to PR in other languages!

@niyumard
Copy link
Author

niyumard commented Jan 9, 2020

I suggest that you let users use their font of choice, I think that might help.

I also tried changing "__stutter_right" to "__stutter_left" and it helps! although there's a problem again because Persian/Arabic script doesn't use block letters but is cursive in its nature.

@niyumard
Copy link
Author

niyumard commented Jan 10, 2020

You may be able to solve the cursive problem by using this character: "ـ"
https://en.wikipedia.org/wiki/Kashida
Which for example when added to س makes it سـ which is perfect for the start or middle of a word س itself being used in the end of a word.

@jamestomasino
Copy link
Owner

Ahh, so I'll need to make my word divide character into a configurable value in the json object as well. That's very good to know.

Other than the display being in the wrong direction, is Stutter reading through Persian text in the correct direction so that each word is in the correct order? If so, i think the steps needed to add support would be:

  • Add functionality to display RTL languages visually RTL
  • Add support for other fonts that can display the characters properly (possibly by using CSS Variables and a user defined string)
  • Add a custom word divide character
  • Modify the regex properties for the language to parse it correctly

Can you think of anything else?

@niyumard
Copy link
Author

Is Stutter reading through Persian text in the correct direction so that each word is in the correct order?

Yes the order is right.

can display the characters properly

The characters are displayed properly but I'd rather see them in another font, this one's too ugly for Persian texts, so maybe this one's not that much of a priority but if you can make it happen it'd be great.

Can you think of anything else?

Not really, I'm not sure how stutter divides words.

@jamestomasino
Copy link
Owner

I've added more information to the README regarding localization. I moved the locales content to its own JSON file as well. I'll need to add more features in for Persian than are currently available, but if you'd like to start creating a "fa" entry that would be helpful. I assume the first regular expression will still work since it's just splitting on whitespace. The second one which splits on "." or "," will probably need to be changed. Finally, the presub section will need a lot of love.

That stuff collectively is the 4th item in the checklist above. I'll have to do 1-3 myself.

@niyumard
Copy link
Author

niyumard commented Feb 19, 2020

Well I can't master regex at the moment it seems, how about I write down Persian alphabet and common prefixes here?

@jamestomasino
Copy link
Owner

Lets start with that and see how it goes. :)

@niyumard
Copy link
Author

Here are the Persian alphabet:

ا ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی
But some may also use these characters too:
ك ء ة آ إ ي ئ ؤ

Complex words maybe separated in two ways, the correct way is by zero-width non-joiner but some may separate inside a word with space or some may not use any, for example:
correct form for a prefix:

می‌خواهم

but people also use:

می خواهم

and

میخواهم

correct form for a suffix:

کتاب‌ها

but people also use:

کتاب ها

or

کتابها

so here are some prefixes:

می

and here are some suffixes:

ها های تر ترین کده گان گانه گر وار ستان

anytime there's a zero-width nonjoiner you can easily separate that word in two parts although they obviously should come together for example:

کم‌محبت = کم + محبت

I hope that it helps!

@niyumard
Copy link
Author

niyumard commented Feb 6, 2021

I think I've found the main problem. It seems that separating words down to letters (or a group of letters) isn't a good idea for cursive scripts in which the letters change shape according to their position in the word. When the extension tries to make one letter red, it does so by separating that single letter, so it gets separated and is shown in the wrong way. In Persian and languages with Arabic script in general, the letters change shape according to their adjacent letters.

For example in the word کتاب, the letter ت becomes ـتـ when it's medial and surrounded with certain other letters.
The same thing goes for other letters as well. They change shape according to their position in the word as mentioned in the wiki.

What we need is to introduce Keshida in stuttter.
So the solution for the letter ت is that if it's isolated it doesn't need any kashida.
if it's the first letter, then it needs to be connected to the next letter, in that case the browser itself processes it in the right way. If you copy تا and remove the first character, you can see what happens.
If we want to separate it though, we need one keshida, تـ‌ا
and if it's in the middle it needs to keshida charachters, one before and one after it: ـتـ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants