Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Count in content structures does not count Chinese words properly #13796

Open
sandymcfadden opened this issue Feb 9, 2019 · 8 comments
Open
Labels
Internationalization (i18n) Issues or PRs related to internationalization efforts [Package] Word count /packages/wordcount [Type] Bug An existing feature does not function as intended

Comments

@sandymcfadden
Copy link

Describe the bug
When writing a post in Chinese the word count shown in the content structure does not show an accurate word count.

To Reproduce
Steps to reproduce the behavior:

  1. Use http://generator.lorem-ipsum.info/_chinese to generate some sample Chinese text
  2. Create new page in Gutenberg
  3. Paste in content
  4. Click the info icon at the top
  5. See an incorrect word count

Expected behavior
When using the same content in word processor like Pages the word count is significantly larger than the one shown in Gutenberg. The expected behavior would be to have an accurate word count independent of the language used.

Screenshots
You can see a large amount of content but it is showing only 10 words
screen shot 2019-02-09 at 7 41 03 am

Desktop (please complete the following information):

  • OS: MacOS 10.14.2
  • Browser: Firefox
  • Version> 64
@jorgefilipecosta
Copy link
Member

jorgefilipecosta commented Feb 11, 2019

It looks like this happens because in some languages words may not be separated by spaces. e.g: 这是鸟 means "This is a bird" and it was 3 words without a single character space.
Counting words is a complex problem, in some languages, the best approach in some cases may be count each character and use a character to word ratio but then we may have docs that mix languages so we need to identify the best method to use per segment.
The following external link describes an algorithm used to count words https://docs.sdl.com/LiveContent/content/en-US/SDL%20WorldServer-v3/GUID-376E123B-1C7E-4D64-82B0-1D33F088ABD5 it may be helpful for this issue.

@jorgefilipecosta jorgefilipecosta added the [Type] Bug An existing feature does not function as intended label Feb 11, 2019
@Jackie6
Copy link
Contributor

Jackie6 commented Feb 27, 2019

@jorgefilipecosta I think there may be two ways to fix this bug. One way is like the atom-word-counter, we will present both the count of words(based on English words) and the count of characters (All kinds of characters excluding white space). This only requires little changes in the UI.
image
Another way is like the MS Office word, if we count the words in sentences mixed with East Asian languages and Latin languages like "Hello 你好", there are three words (one English word+two Chinese characters). This requires a significant change in the count function, especially the matchWords
image
Which one is better? Does anyone have any idea?

@jorgefilipecosta
Copy link
Member

Thank you for summarizing and sharing your thoughts @Jackie6.
I am also not sure which option is better in a case like this, cc: @jasmussen, @mapk, @kjellr in case you have some thoughts on this.

@jorgefilipecosta jorgefilipecosta added Internationalization (i18n) Issues or PRs related to internationalization efforts Needs Design Needs design efforts. labels Mar 7, 2019
@jasmussen
Copy link
Contributor

Great ticket. It seems like the two options presented appear to be the "easy" version (count words and characters), and the hard version (be aware of the language when counting words).

It seems like the latter is the better user experience, but it could be so difficult that unless we get solid pull requests it may take a while for this to appear. Whereas for the former, it's probably both easy to build, and a character count could likely be useful regardless of language.

Keeping in mind we mean to merge the Document Outline tool with the Block Navigation tool, we could possibly build solution 1 at the same time, and then consider upgrading to version 2 at a later time?

@gziolo
Copy link
Member

gziolo commented Mar 25, 2019

@sandymcfadden, there is #14589 opened with a proposal of how to resolve this issue as suggested in the discussion above.

@karmatosed karmatosed removed the Needs Design Needs design efforts. label May 2, 2019
@swissspidy
Copy link
Member

Related: #24823 was merged, but it seems this issue is still relevant.

cc @david-szabo97

@david-szabo97
Copy link
Member

@swissspidy Ugh, this is a difficult topic.
IMHO, the best would be to move to the list view and do what Google Docs does.
Show Words, Characters, and Characters excluding spaces. Even though Words is not useful information for languages like Chinese or Japanese, I don't think we can accurately handle all the languages. If we show all the three variations, then we can leave it to the writer to decide which information is useful to him/her.

image

@talldan
Copy link
Contributor

talldan commented Feb 1, 2023

There was a brief discussion about this on #43403, and one idea I had is to use the unicode character ranges of written content to determine how words or time to read is calculated.

I found a library that seems to use that approach - https://github.com/ngryman/reading-time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internationalization (i18n) Issues or PRs related to internationalization efforts [Package] Word count /packages/wordcount [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants