Word Count in content structures does not count Chinese words properly #13796

sandymcfadden · 2019-02-09T12:17:55Z

Describe the bug
When writing a post in Chinese the word count shown in the content structure does not show an accurate word count.

To Reproduce
Steps to reproduce the behavior:

Use http://generator.lorem-ipsum.info/_chinese to generate some sample Chinese text
Create new page in Gutenberg
Paste in content
Click the info icon at the top
See an incorrect word count

Expected behavior
When using the same content in word processor like Pages the word count is significantly larger than the one shown in Gutenberg. The expected behavior would be to have an accurate word count independent of the language used.

Screenshots
You can see a large amount of content but it is showing only 10 words

Desktop (please complete the following information):

OS: MacOS 10.14.2
Browser: Firefox
Version> 64

jorgefilipecosta · 2019-02-11T17:42:37Z

It looks like this happens because in some languages words may not be separated by spaces. e.g: 这是鸟 means "This is a bird" and it was 3 words without a single character space.
Counting words is a complex problem, in some languages, the best approach in some cases may be count each character and use a character to word ratio but then we may have docs that mix languages so we need to identify the best method to use per segment.
The following external link describes an algorithm used to count words https://docs.sdl.com/LiveContent/content/en-US/SDL%20WorldServer-v3/GUID-376E123B-1C7E-4D64-82B0-1D33F088ABD5 it may be helpful for this issue.

Jackie6 · 2019-02-27T16:28:12Z

@jorgefilipecosta I think there may be two ways to fix this bug. One way is like the atom-word-counter, we will present both the count of words(based on English words) and the count of characters (All kinds of characters excluding white space). This only requires little changes in the UI.

Another way is like the MS Office word, if we count the words in sentences mixed with East Asian languages and Latin languages like "Hello 你好", there are three words (one English word+two Chinese characters). This requires a significant change in the count function, especially the matchWords

Which one is better? Does anyone have any idea?

jorgefilipecosta · 2019-03-07T12:10:27Z

Thank you for summarizing and sharing your thoughts @Jackie6.
I am also not sure which option is better in a case like this, cc: @jasmussen, @mapk, @kjellr in case you have some thoughts on this.

jasmussen · 2019-03-07T14:32:13Z

Great ticket. It seems like the two options presented appear to be the "easy" version (count words and characters), and the hard version (be aware of the language when counting words).

It seems like the latter is the better user experience, but it could be so difficult that unless we get solid pull requests it may take a while for this to appear. Whereas for the former, it's probably both easy to build, and a character count could likely be useful regardless of language.

Keeping in mind we mean to merge the Document Outline tool with the Block Navigation tool, we could possibly build solution 1 at the same time, and then consider upgrading to version 2 at a later time?

gziolo · 2019-03-25T07:30:51Z

@sandymcfadden, there is #14589 opened with a proposal of how to resolve this issue as suggested in the discussion above.

swissspidy · 2020-08-31T14:14:42Z

Related: #24823 was merged, but it seems this issue is still relevant.

cc @david-szabo97

david-szabo97 · 2020-08-31T14:24:17Z

@swissspidy Ugh, this is a difficult topic.
IMHO, the best would be to move to the list view and do what Google Docs does.
Show Words, Characters, and Characters excluding spaces. Even though Words is not useful information for languages like Chinese or Japanese, I don't think we can accurately handle all the languages. If we show all the three variations, then we can leave it to the writer to decide which information is useful to him/her.

talldan · 2023-02-01T05:33:48Z

There was a brief discussion about this on #43403, and one idea I had is to use the unicode character ranges of written content to determine how words or time to read is calculated.

I found a library that seems to use that approach - https://github.com/ngryman/reading-time.

jorgefilipecosta added the [Type] Bug An existing feature does not function as intended label Feb 11, 2019

jorgefilipecosta added Internationalization (i18n) Issues or PRs related to internationalization efforts Needs Design Needs design efforts. labels Mar 7, 2019

Jackie6 mentioned this issue Mar 22, 2019

Add characters count in the table of contents #14589

Closed

7 tasks

karmatosed removed the Needs Design Needs design efforts. label May 2, 2019

skorasaurus added the [Package] Word count /packages/wordcount label Jan 24, 2022

skorasaurus mentioned this issue Jan 24, 2022

@wordpress/wordcount planned support for Chinese, Japanese, etc characters? #38186

Closed

skorasaurus mentioned this issue Jan 24, 2023

Inconsistent Excerpt Length in some alphabets #47383

Open

ntsekouras mentioned this issue Jan 31, 2023

[New Block] Add post time to read block #43403

Merged

t-hamano mentioned this issue Aug 17, 2023

Stabilize Time to Read block #53776

Open

tobifjellner mentioned this issue Apr 15, 2024

Is solution for AVERAGE_READING_RATE incomplete for i18n? #60741

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Count in content structures does not count Chinese words properly #13796

Word Count in content structures does not count Chinese words properly #13796

sandymcfadden commented Feb 9, 2019

jorgefilipecosta commented Feb 11, 2019 •

edited

Loading

Jackie6 commented Feb 27, 2019

jorgefilipecosta commented Mar 7, 2019

jasmussen commented Mar 7, 2019

gziolo commented Mar 25, 2019

swissspidy commented Aug 31, 2020

david-szabo97 commented Aug 31, 2020

talldan commented Feb 1, 2023

Word Count in content structures does not count Chinese words properly #13796

Word Count in content structures does not count Chinese words properly #13796

Comments

sandymcfadden commented Feb 9, 2019

jorgefilipecosta commented Feb 11, 2019 • edited Loading

Jackie6 commented Feb 27, 2019

jorgefilipecosta commented Mar 7, 2019

jasmussen commented Mar 7, 2019

gziolo commented Mar 25, 2019

swissspidy commented Aug 31, 2020

david-szabo97 commented Aug 31, 2020

talldan commented Feb 1, 2023

jorgefilipecosta commented Feb 11, 2019 •

edited

Loading